How to solve the problem of 'same sound, different name', where notes with different pitches under different keys sound the same?

I noticed that during model training and generation, audio is split into multiple independent parts and generated based on a greedy strategy. However, in reality, there may be cases of 'same sound, different name', where notes with different pitches under different keys sound the same, and the independent splitting approach ignores this information because one independent part cannot see other parts. This results in generated MIDI that sounds very authentic, but with many incorrect notes. Is there a way to solve this?

magenta / mt3

How to solve the problem of 'same sound, different name', where notes with different pitches under different keys sound the same? #120