preprocessing not working correctly

AndreLamurias commented 5 years ago

at the moment the preprocessing step is not generating the correct output, and the trained model obtains a low performance. Meanwhile I have uploaded the dditrain and dditest files that you can move to the temp/ directory to train the model: https://drive.google.com/drive/folders/1wKfdeLGm9x4PbmfkYj9Iz8S7jZZz8PUJ?usp=sharing

mjlorenzo305 commented 5 years ago

Thanks Andre @AndreLamurias I was able to get the preprocessing step to complete. I previously thought it was hanging, but it turns out it takes a very long time to complete. I set logging to INFO level to help indicate that progress was occurring (slowly) during the get ddi sdp instances steps.

Once preprocessing completed I verified that the dditrain numpy arrays contained data which I used to then train the model. As you noted above, the training of the full model produced low performance. (Model converges at around .45 F1 on test set after 40 epochs)

I'll try out your pre-processed dataset above and re-train. I'll let you know. Thanks, Mario

mjlorenzo305 commented 5 years ago

@AndreLamurias

I tried the provided preprocessed file by placing them in temp (and moving my previously generated file). After invoking the train process it fails as follows:

Traceback (most recent call last): File "src/train_rnn.py", line 832, in main() File "src/train_rnn.py", line 570, in main train(sys.argv[3], sys.argv[4:], train_inputs, id_to_index) File "src/train_rnn.py", line 397, in train inputs, w2v_layer, wn_index = prepare_inputs(channels, train_inputs, list_order, id_to_index) File "src/train_rnn.py", line 349, in prepare_inputs X_ids_left = preprocess_ids(X_subpaths_train[0], id_to_index, max_ancestors_length) File "src/train_rnn.py", line 204, in preprocess_ids idxs = [id_toindex[d.replace("", ":")] for d in seq if d and d.startswith("CHEBI")] File "src/train_rnn.py", line 204, in idxs = [id_toindex[d.replace("", ":")] for d in seq if d and d.startswith("CHEBI")] KeyError: 'CHEBI:32134'

I ran it using the following command: python src/train_rnn.py train temp/dditrain full_model words wordnet common_ancestors concat_ancestors

AndreLamurias commented 5 years ago

This is due to the different versions of the chebi ontology. The ID of that compound was updated since we generated those files. I will open another issue so that "alt_id" field is also considered.

For future reference, we used this version of the chebi ontology: ftp://ftp.ebi.ac.uk/pub/databases/chebi/archive/rel158/

mjlorenzo305 commented 5 years ago

Thanks @AndreLamurias I was able to complete the training using the above mentioned version of chebi obo along with the provided set of preprocessed data (numpy arrays).

I observed some improvements in model performance of val_f1 at .60 but still not as high as expected after 100 epochs. Convergence occurs at around 30 epochs.

Any thoughts or ideas on what other param tuning is required?

Thanks, Mario

Here is the summary for the 100th Epoch

Epoch 100/100

244s - loss: 0.0268 - acc: 0.9902 - precision: 0.9745 - recall: 0.9702 - f1: 0.9714 - val_loss: 0.6955 - val_acc: 0.8616 - val_precision: 0.6170 - val_recall: 0.5524 - val_f1: 0.5732

predicted not false: 1372/1537 [[5945 133 180 74 9] [ 214 268 27 8 0] [ 212 15 383 27 2] [ 118 8 30 158 1] [ 17 0 6 1 42]] VAL_f1: 0.604 VAL_p: 0.653 VAL_r 0.564 Epoch 00100: val_loss did not improve from 0.39073

mjlorenzo305 commented 5 years ago

Following up on my last comment: Looks like I confused the DDI Detection with the DDI Classification task. The model I trained above was for DDI classification and therefore the val_f1 matches (or slightly better) than the performance reported in the BOLSTM paper. (Correct me if I am mistaken)

AndreLamurias commented 5 years ago

@mjlorenzo305 yes those scores are for the DDI classification task

lasigeBioTM / BOLSTM

preprocessing not working correctly #3