seyonechithrananda / bert-loves-chemistry

bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.
MIT License
389 stars 60 forks source link

Regression text dataset #31

Closed ElanaPearl closed 3 years ago

ElanaPearl commented 3 years ago

Accessing rows from the HuggingFace data_loader with the "csv" loader seemed to be slowing down the MTR training. Made some changes that should fix this. The bigger the dataset, the more of an improvement this should make. With 1k compounds you cannot tell the difference between the old + new methods, with 1M compounds we do 160% more iterations/s. Haven't benchmarked with larger datasets since I'm doing this on a small machine but we should test this tomorrow with the full dataset.

What's in the PR:

This makes a few assumptions (that are easy to stick with):