slryou41 / reaction-gcnn

Chainer implementation of Graph Neural Networks for the Prediction of Substrate-Specific Organic Reaction Conditions
10 stars 8 forks source link

custom input #3

Open rmrmg opened 2 years ago

rmrmg commented 2 years ago

Hello I want to train your model with my data, hence I have couple questions regarding input format. In train.py script there are definition of input as below presented

        datafile = 'data/suzuki_type_train_v2.csv'
        class_num = 119
        class_dict = {'M': 28, 'L': 23, 'B': 35, 'S': 10, 'A': 17}
        dataset_filename = 'data.npz'
        labels = ['Yield', 'M', 'L', 'B', 'S', 'A', 'id']
  1. The class_num is out_dim from MLP. How and what should I set here? Based on train file it seems it somehow connected with class_dict. My first idea was it should be sum(class_dict.values()) but is not true.

  2. class_dict as far as I understand keys are column name in csv and the value is a number of unique value. Value in csv should have format 'xNUMx' where x is any characted a NUM should be number (as I see in dataset/suzuki_data_frame_parser.py). There should be continues numbering of different kind of compounds (e.g M0M to M9M and then L10L to L19L if we want to include 10 ligand and 10 metals). Please correct me if I am wrong.

  3. How labels correspnd to column in csv file and in class_dict? Based on the code I found that csv should have 'Reactant1', 'Reactant2' and 'Product' column? Are there any other obligatory column name?

koyurion commented 1 year ago

Hi, I think the 119 is equal to: yield: 1, sum(class_dict.values()): 113 each class should be have a null/other label: 5