Simple/Basic tutorial - Githubissues

Jaceyjc commented 2 years ago

Hello,

Perhaps a very simple question despite having dug into the graphormer documentation: Assume given a csv training file containing tens of thousands of SMILES and their corresponding properties, say just 2 properties. For example the training .csv file will look like

SMILES, Property 1, Property 2
SMILE1, some_number11, some_number21 
SMILE2, some_number12, some_number22
SMILE3, some_number13, some_number23
...

At this point, I only know that I will have to invoke the command fairseq-train --user-dir path/to/graphormer with a lot of settings from the fairseq page (https://fairseq.readthedocs.io/en/latest/command_line_tools.html#fairseq-train) and Graphormer's documentation. On the website, I could just roughly make sense of the oc20 (Train a New Model) example that uses lmdb files. So for the very simple case mentioned above (1 huge training csv file, and a separate test csv file), how do I go about using Graphormer?

I would appreciate if someone could give some pointers or direct me to the correct destination in the documentation that I might have overlooked.

Thank you.

Best, Jacey

zhengsx commented 2 years ago

We're working on making the custormer dataset more friendly to be used, and also prepare some related tutorials, please stay tune.

Jaceyjc commented 2 years ago

Hi, that would be very helpful - thank you!

I have been navigating around and have managed to run the example scripts. Just an issue I have been struggling with - how do I get (and save) the prediction results of my test set while running the 'evaluate' script? At the moment, I am only getting values for auc. I tried using the --results-path command from fairseq but that did not work.

zhengsx commented 2 years ago

Save this variable.

JiaYuanChng commented 2 years ago

Perhaps I have a question about this too since I have been struggling to make my training converge using my own dataset.

Could you please kindly confirm if the workflow detailed below is correct:

given a dataset similar to that above (SMILES, Property), first convert smiles into DGL using some package (DGL-LifeSci, etc.)
generate data split and identify indices/range of the train/valid/test
reconstruct the dataset as a list of tuples [(DGL1, property1), (DGL2, property2), ... ]

Have I missed anything?

Thank you.

microsoft / Graphormer

Simple/Basic tutorial #120