Closed asigalov61 closed 3 months ago
@asigalov61 Hey, thanks for your interest in our work. I appreciate you attaching the transcription results. To be honest, I still think that it doesn’t work well outside of the dataset.
test-time augmentation
method. We might expect some slight boost of performance.tie
token, and of course, post-processing to connect these notes is already in place. Nonetheless, there are still instances where the notes sound disjointed or where the decoder occasionally falls silent. I suspect (though it's just a hunch) that providing extended context could improve this. In fact, the recent MR-MT3 paper reported a slight improvement in performance. There are many papers on long-context transformers (e.g., Transformer-XL), and the most recent one is probably TransformerFAM.I’ve also been interested in your project for some time. Thanks for sharing your thoughts!
Best, Sungkyun
@mimbres Thank you for your detailed response.
Yes, training on more data/timbres of different instruments and combinations of instruments should definitely help IMHO. The drawback would be compute resources needed to train such a model.
Anyway, I will be following your research and this year ISMIR so I wanted to wish you success in your presentation and your work :)
And thank you for your interest in my project as well. If you have any questions about it, do not hesitate to contact me :)
Most sincerely,
Alex.
@asigalov61
Unfortunately, I will not be able to attend ISMIR this year! 🙁(I missed the submission deadline and had to submit this work to another conference, MLSP in London...)
@mimbres Sorry to hear that... Wish you luck anyway! Your model/implementation is very nice so I hope it will be noticed anyway :)
@mimbres Hey guys!
First of all, I wanted to thank you for sharing this work and for making nice HF space demo/Google Colab. It works great :)
This is definitely an improvement over the baseline Google model so congrats on the results. In particular, your YPTF.MoE+Multi (noPS) model does very well on outside of the dataset samples. I am attaching the results if you care to take a look :)
transcription_results.zip
Now, the main reason I am writing is that I wanted to make couple of suggestions that I think you may find useful:
1) Even Google baseline MT3 model transcribes pretty well if the audio is very similar to the training data. So I wanted to suggest to think about integrating audio pre-processing which would make it sound more like the training data audio. I think the down sampling coupled with hi/low pass filters could work here. Some other methods can also help. Since you used Slakh which was rendered with a specific SF banks, you should be easily able to create a pre-processing audio filter which would mimic the Slakh instruments sounds and frequencies.
2) I think you can also try to implement post-processing for the resulting MIDI output, such as joining the neighboring/overlapping durations of the notes of each transcribed instrument. MT transcription models often split individual (longer) notes into smaller ones, so putting it back together should really help. It can be easily done after the MIDI output is generated by combining durations.
3) Last but not least I also wanted to suggest adding some basic form of alignment to bars for the transcribed output. I think it should further make output useful and nice :)
Anyway, thank you again for your work and for the demos.
Sincerely,
Alex.