user04f8 / Harmonixr

0 stars 0 forks source link

Proposal Feedback #1

Closed ryeduru closed 5 days ago

ryeduru commented 2 weeks ago

Hey, great job on the proposal! I really like how you've structured your approach to using Transformer models for music generation—it’s a creative and impactful application. A few points to strengthen your proposal further:

The Lakh MIDI Dataset is a strong starting point, but it could be helpful to address any potential limitations. For example, are all genres adequately represented, or will you need to supplement the dataset with additional MIDI files? This ensures your model generalizes well across diverse musical styles. Your data cleaning steps, especially standardization and quantization, are on point. Just a suggestion: adding a brief note on how you’ll handle noisy or malformed MIDI files would clarify how you’re maintaining data quality.

For the model architecture, I see you’re leveraging the Transformer, which is a great fit for sequence-based tasks like music generation. Maybe consider mentioning why a specific variant (e.g., encoder-decoder vs. decoder-only) might be preferred in your context. A short explanation on how you plan to experiment with different setups would show that you’ve thought through potential performance differences.

Lastly, incorporating Reinforcement Learning from Human Feedback (RLHF) is a fantastic idea! It would help to outline the kind of feedback criteria you plan to use—whether it’s about overall musical coherence or specific elements like harmony or rhythm. Establishing this framework early on will make future fine-tuning more straightforward.

Overall, this is a well-structured and promising proposal. Just a few clarifications and additional details, and it’ll be in great shape!

user04f8 commented 5 days ago

Hey! Regarding limitations with the dataset, I do want to clarify that the goal is specific to the restrictions of the musical style we wish to train to, as broadening beyond this (e.g. outside of 12-TET) creates a feature creep beyond the scope of this project and indeed the limitations of MIDI (save for some MPE stuff that's not broadly supported and hardly any data exists for anyways). While we will consider other datasets and as much data as possible is ideal, on the contrary generalizing across diverse musical styles is emphatically not the goal nor scope of this project as opposed to finetuning to a specific niche. Further, the Lakh dataset is so broad (not to mention covers fairly broad styles across centuries) that dataset size is so large that unless I rent thousands of dollars in compute, I can't even leverage the full scale of the Lakh dataset to begin with.

We will test multiple alterations of the architecture in a quantitative way via standard hyperparameter optimization and ablation studies. We are not aware of existing work justifying and empirically validating architectural decisions at the level of encoder-decoder vs. decoder-only, and the reason for using one or the other will be empirically founded on the actual performance of our model on validation data.

RLHF is a future goal I'd like to build at a level of modularity sufficient to support, but is well, WELL beyond the scope possible within a couple months in this project. Realizing this, I've elected to remove this from the project proposal for now as work like what you mention won't result in a better product within the time we have to build something. Instead, my original thought process for including this was a consideration in favor of avoiding technical debt and considering future scope, which is probably generally true regardless of what specifically we wish to do with this project beyond the end of the CS506 course.

user04f8 commented 4 days ago

See https://github.com/user04f8/Harmonixr/pull/4