seyonechithrananda / bert-loves-chemistry

bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.
MIT License
389 stars 60 forks source link

Details about curating pubchem dataset #55

Open taew0361 opened 2 years ago

taew0361 commented 2 years ago

Thank you for publishing this great work!. I have a question about the pubchem dataset, using as a pretraining set.

In this arxiv paper, it is shortly mentioned that the 77M pubchem dataset is curated to the 10M pubchem data.

Could you explain a bit more about the details how to curate the 77M pubchem dataset?

ex) Smiles with nonbonding is removed