Working with a csv.file directly

mehedihasandesu commented 1 month ago

Hi, I am currently trying to implement your model using my own dataset, which is in a .csv format containing SMILES codes for molecules along with their corresponding HomoLumo values. My goal is to apply your model as it is, following the procedures outlined in the OGB utility scripts. However, I am having difficulty integrating my .csv file into the pipeline, particularly when it comes to data preprocessing, such as splitting the data and performing the necessary conversions and embeddings to train the model. I am a beginner in machine learning, I would greatly appreciate any guidance or suggestions you could provide to help me successfully process my data and implement the model.Thank you very much.

shamim-hussain commented 1 month ago

You basically need to implement a class similar to this and use that dataset

https://github.com/shamim-hussain/egt_pytorch/blob/9e66956a5fdc6f6e8a865863d029468380bb63e5/lib/data/pcqm4mv2/data.py#L8C1-L49C21

Notice that in record_tokens function we create a unique identifier for each molecule, and in read_data we specify how to get the graph data (nodes, edges, target). Notice how we use the smiles2graph function to convert smiles into graphs.

Then if you subclass them, with mixins like GraphDataset, SVDEncodingsGraphDataset, StructuralDataset, etc. additional features will be automatically attached.

mehedihasandesu commented 1 month ago

Its working now. Thank you so much.

shamim-hussain commented 1 month ago

You are welcome

shamim-hussain / egt_pytorch

Working with a csv.file directly #6