nomic-ai / contrastors

Train Models Contrastively in Pytorch
Apache License 2.0
459 stars 35 forks source link

Unable to save models in mlm pretraining #15

Closed sandeep-krutrim closed 3 months ago

sandeep-krutrim commented 3 months ago

I followed the steps to prepare data, push it to huggingface hub and do mlm pretraining. However, the model is not able to save after 1 epoch of training. I get the following error -

File "/disk1/sandeep/contrastors/src/contrastors/train.py", line 56, in main
main(config, args.dtype)
torch.save(sampler.state_dict(), f"{output_dir}/sampler.pt")  File "/disk1/sandeep/contrastors/src/contrastors/train.py", line 56, in main

    AttributeErrortrainer.train(): 
'DistributedSampler' object has no attribute 'state_dict'

I am following the standard steps mentioned in the repo without any modifications. Please help

zanussbaum commented 3 months ago

Thanks for catching this! This is a bug on my part, I don't believe we need to explicitly save the sampler since we do also save the random state and reset it. A quick fix is to comment out these lines: https://github.com/nomic-ai/contrastors/blob/a52d8cacaa5b98f81623671612d8c1ff046eb824/src/contrastors/trainers/base.py#L277:L278 and replace it with pass

I will fix this shortly!

zanussbaum commented 3 months ago

this should be fixed in #24