Open vmenan opened 2 months ago
Hello @vmenan,
I think collate_fn()
is an appropriate place to extend.
What type of input format do you use? plain text, tsv, or huggingface's dataset? If you could provide some dummy samples of your multi-source input data, I might be able to help you further :)
Hi @may-
Thank you so much for your reply. I apologize for the delayed reply, I will look into the collate_fn()
. I will describe the task for you in detail.
I create a huggingface's dataset class for english to german translation dataset. for example say i have 3 different english to german datasets (subtiltles, parliment and medical data) named dataset A, B and C, each having 100K datapoints each.
What i was thinking of doing was to overide the training manager class, so that i can edit the batch for loop by zip(A,B, C), in this way i get 3 batches at once, I pass it throught the model, get the loss respective of each batch, then perform a weighted sum of the loss and finally take a step. When one batch from each dataset is passed through the model, then i consider that as one training step.
The code is beautifully written in JoeyNMT but i believe i may to make some changes to achieve this. Do you think there is a better way to approach this? Thank you so much for your support!
Hi @vmenan,
Ah, ok, now I understand your project better. So, you'd like to compute the loss separately for each dataset, right? Then you need to change the training manager class, indeed.
I thought three scenarios:
I cannot say which approach is better, it depends on your goal.
Hi! I came across this library very recently and i am loving it! In my current research I am trying to implement knowledge distillation, which requires multiple datasets to be passed in, here a single step is considered when one batch from each dataset has gone through the model. I am struggling a bit to extend the current joeynmt to achieve this. It would be wonderful if i can get help in this.