nomic-ai / contrastors

Train Models Contrastively in Pytorch
Apache License 2.0
459 stars 35 forks source link

Regarding Further Training nomic-bert-2048 #13

Closed pjchungmd closed 3 months ago

pjchungmd commented 4 months ago

First of all, thank you all for sharing your work and the models. Is it possible to take nomic-bert-2048 and to continue to train it on the MLM objective on additional data (e.g. biomedical text)? If so, what would be the recommended way to approach that?

I have currently looked at using the hugging face Trainer API and starting with the released nomic-bert-2048 weights, and then modifying the pretokenize.py code to convert the data into the format that was used to train the model. I haven't attempted to use the train code in this repository because it seems to require starting from scratch, and I was hoping to avoid that if possible.

Thanks in advance for guidance!

zanussbaum commented 4 months ago

I haven't tested it but you might be able to resume from a checkpoint and train from nomic-bert: https://github.com/nomic-ai/contrastors/blob/main/src/contrastors/configs/train/mlm.yaml#L15