First of all, thank you all for sharing your work and the models. Is it possible to take nomic-bert-2048 and to continue to train it on the MLM objective on additional data (e.g. biomedical text)? If so, what would be the recommended way to approach that?
I have currently looked at using the hugging face Trainer API and starting with the released nomic-bert-2048 weights, and then modifying the pretokenize.py code to convert the data into the format that was used to train the model. I haven't attempted to use the train code in this repository because it seems to require starting from scratch, and I was hoping to avoid that if possible.
First of all, thank you all for sharing your work and the models. Is it possible to take
nomic-bert-2048
and to continue to train it on the MLM objective on additional data (e.g. biomedical text)? If so, what would be the recommended way to approach that?I have currently looked at using the hugging face Trainer API and starting with the released
nomic-bert-2048
weights, and then modifying the pretokenize.py code to convert the data into the format that was used to train the model. I haven't attempted to use the train code in this repository because it seems to require starting from scratch, and I was hoping to avoid that if possible.Thanks in advance for guidance!