neulab / awesome-align

A neural word aligner based on multilingual BERT
https://arxiv.org/abs/2101.08231
BSD 3-Clause "New" or "Revised" License
321 stars 46 forks source link

Torch.save() for large training Dataset #56

Open stelmath opened 1 year ago

stelmath commented 1 year ago

Hello,

I know this is not directly related to awesome-align, but I have a large training set of 10M source/target pairs and it takes 4 hours to process them before the actual model training even starts. So, I used the --cache_data option provided, but when the code reaches torch.save(self.examples, cache_fn) it just takes forever and it uses all my RAM (60GB is not enough on a VM - it swaps into HDD on my 32gb RAM PC and then takes forever to complete). Would you know any alternative way to save the Dataset and then reuse it before training to skip the loading overhead time?

Many thanks

zdou0830 commented 1 year ago

Hello, thanks! I think you can subsample the data as it wouldn't require a lot of parallel data to finetune BERT for word alignment.