jhuang448 / LyricsAlignment-MTL

MIT License
56 stars 12 forks source link

List of notes and their order to finetune the model #1

Closed littlebeanhp closed 1 year ago

littlebeanhp commented 1 year ago

Thank you for submitting this solid work. We are finetuning your model on Vietnamese language which has almost the same phones and char tokens. However we dont have the list of groundtruth notes (D2 to C6) in your paper description, and if they were from a list, we have no list order.

So im raising this issue to ask for the note list you used in this model repo (47 notes) I look forward to seeing your reply as our deadline is coming near... :(

jhuang448 commented 1 year ago

Hi, Thank you for your interest in my work. I do not quite get your question, but let me clarify:

The groundtruth notes in the DALI dataset is in Hz, and I converted that to a MIDI note number, using this function. If you are asking about the relationship between MIDI numbers, notes, and frequencies, here is a random page I found by googling.

MIDI numbers are in integers, so that I can convert it to a pianoroll-like representation. In the dataloader, I load the pianoroll, and offset it by 38, so that 0 corresponds to D2, 1 corresponds to D#2, ..., etc.

You will need to obtain the pitch ground truth for your dataset. If you do not have that, you might try some pitch tracker algorithm to generate the 'ground truth'. For example, using crepe and quantizing the frequencies to midi numbers.

Please let me know if you have further questions.

BTW, just curious, I recently received several inquiries regarding my implementation, are you working on a course project or something?

Jiawen

littlebeanhp commented 1 year ago

Your reply contains all the answers that I need, and it is very on point even you said you haven't catched what I asked 😄

As I don't have much music knowledge, I thought my Note list from D2-C6 is somehow different from yours: D2-D#2-F2-F#2... -B6-C6.

With this list, our note ground truth consists of 47 tokens + 1 space token --> 48 tokens total. In the end, all the notes and pitches are transformed into numpy/torch array anyway, but Im afraid different type of note ground-truths between pretrained and new finetuning data will trick the model.

Your reply clarified that your work used the same note systems. Maybe this information is a bit confusing in your paper, as you said: "The target pitch range is D2-C6, therefore Npitch is 47 (with one additional class for silence)". This statement can be understood as Npitch=47 with "S"(silence token) lies within.

And finally, I want to ask a last question of the "silence" note. In your last reply, the list of notes is substracted to 38, which means D2:0 - D#2:1... This is a valuable information, but what is the position of the "silence" token? Is it at the end of the list? (index 48), or labeled by another integer?(-100 or something)

Thank you so much again for answering me. This is a bit new to my research career, hope you understand :)

And btw we are not working on a course project. We were taking part in a huge challenge in Vietnam AI community and meanwhile found out your solid works. All the phoneme dictionary were modified by us for the Vietnamese songs. And we also generated all the notes by frame (C3,D3,C2,B3... blah blah). The last step is only apply the note ground truth on the training pipeline. Maybe the order of the note list is not that important as the loss function can easily force the model to learn from new ground truth but we just want to ensure things to avoid getting our model dumped :D

jhuang448 commented 1 year ago

You are right. I double-checked my code and I think that might be a typo in my paper: the highest note included (C6) should actually be B5, as the maximum MIDI number allowed is 83. So there are 46 MIDI classes plus 1 silence = 47 classes. The silence class idx is 46 (counting from 0). Thank you for pointing that out.

I agree it would not matter much as the model will learn to adapt to the new classes. I just want to mention that if you are using your own note list, please make sure you use a resolution equal or smaller than a semitone. I mean, between C3 and D3 there is a C#3, so it is better to assign a class to C#3 instead of quantizing it to C3 or D3, because semitone is a musically meaningful unit.

Given that you already have the frame-level note annotations, you could build the pianoroll representation directly from it.

Good luck with the challenge! And I am happy to see to which extent my model could generalize on Vietnamese songs!