Nice work - are you aware of the USPTO data leak? (+ training issue on HPC)

linminhtoo commented 3 years ago

Hi authors, this is a very nice work. I enjoyed reading the paper as it was well explained. In particular your top-20 and top-50 are very high, beating works like GLN and impressive. Also thanks for being one of the few repos that provide proper documentation + env.yaml for easy install.

However (and as much as I hate to break it to you), I'm not sure if you are aware of the USPTO data leak. Please refer to https://github.com/uta-smile/RetroXpert, a concurrent work as yours, about this issue.

In short, the 1st mapped atom (id: 1) in the product SMILES (and reactant SMILES) is usually the reaction centre. So, if you directly use the original atom mapping in the USPTO dataset to set the order of generation/edits for molecular graphs, I think there's a subtle but important data leak that's going to make it easier for the model to generate the correct reactants (since it's going to implicitly figure out that the first atom in most products (and consequently, reactants) should be the reaction centre and thus going to need some sort of modification, rather than having to learn which atom is really the reaction centre). The authors of RetroXpert have been alerted to this leak and implemented a corrected canonicalization pipeline to ensure that the atom-map ordering is truly canonical.

I have re-trained RetroXpert on that truly-canonical USPTO-50K and their top-1 is significantly lower at ~45% (and so are the remaining top-K's). It makes their work no longer SOTA.

I took a brief look at MEGAN's data scripts but I didn't see any canonicalization part. I am interested in whether you have already corrected for this issue, or if you have already re-ran/are re-running the experiments. Thanks.

mikolajsacha commented 3 years ago

Hi, thank you for your kind comments. We are also aware of the data leak problem in the RetroXpert paper. We have checked for it in our model and we believe that it does not apply to MEGAN.

To train MEGAN, we construct ground truth paths of graph edits, which rely on reaction mapping between substrates and product. However, during evaluation, the model does not use the mapping information - there is simply no such feature in the input representation. Moreover, as MEGAN is a Graph Convolutional Network, its inference does not depend on the order of its nodes; that is, it performs exactly same calculations for input SMILES "CCO" and "OCC", etc.

There is a little tweak in the evaluation code, which might seem like a data leak but we believe it is not. At https://github.com/molecule-one/megan/blob/8cba1016d80dd84120c85348fec861cb4f91304d/bin/eval.py#L142 we make sure that the order of atoms in SMILES is the same as the order of reaction mapping numbers. This tweak is needed because of a bug in the RdKit version that we use, which makes canonical SMILES that contain stereochemical information dependent on order of atoms in the RdKit Mol object. With this bug we can generate a correct prediction but falsely count it as incorrect because the "canonical" SMILES is different for MEGAN prediction and for the ground truth target. This tweak fixes this bug but does not affect the input representation of MEGAN (as it does not know the order of atoms in the graph). We are investigating using a newer version of RdKit and it seems that this issue has been solved since, so when we finally confirm it we will update the code and re-run evaluation using only unmapped canonical product/substrates SMILES as the model input.

linminhtoo commented 3 years ago

Thanks for the detailed reply and explanation. I think it makes sense that MEGAN doesn't suffer from the data leak, but I'll continue to think about it just to make sure.

Sidetracking from the main issue (happy to open separate issue), I want to re-train MEGAN on my own version of USPTO-50K but having some difficulty. I have my own train.csv/valid.csv/test.csv files already prepared (approx 80/10/10 but not the same as the one GLN used).

I ran into "FileNotFoundError: Dataset has not been downloaded and/or generated. Please run "acquire" method to be able to load it!" so I looked at youracquire.py code and I see that you have to make somex.tsv/metadata.tsv and all these files. I could try to understand it on my own but figured it'd be way faster to ask you directly. Can you guide me how to train on my 3 csv files?

edit: can I just rename my 3 files to raw_train.csv/raw_val.csv/raw_test.csv, pretend as if they're downloaded from GLN's dropbox & then run python bin/acquire.py uspto_50k? problem is, I no longer have the reaction 'id' and 'reaction_type_id' in my csv files, and I could monkey patch those lines (line 82/83 of uspto_50k.py) but not sure if it will break later parts of the pipeline.

mikolajsacha commented 3 years ago

For a quick solution, you can replace raw_train.csv/raw_val.csv/raw_test.csv with your own reactions and use them as they were uspto_50k. id can be any string as long as it is unique for each reaction, you can generate it for instance as an ascending integer. reaction_type_id is used only for uspto_50k_rt (the "reaction type given" wariant). In your case if you do not have such information you can simply set this column to 0 for all reactions.

If you want to implement this more "properly", you need to create a new class that inherits Dataset, similar to https://github.com/molecule-one/megan/blob/8cba1016d80dd84120c85348fec861cb4f91304d/src/datasets/uspto_50k.py. It must save data as x.tsv, metadata.tsv and split.tsv. You can probably deduce what each of this files should contain by the code of the acquire method. After writing such class you must add it to the dictionary at https://github.com/molecule-one/megan/blob/8cba1016d80dd84120c85348fec861cb4f91304d/src/config.py#L25. Then you can acquire/featurize/train/evaluate using same commands as for uspto_50k but replacing it with the key of your dataset in this dictionary.

linminhtoo commented 3 years ago

Fantastic. Thanks for the pointers. The model's been training for a while and I didn't get any errors so far. Will leave this issue open for a day or two and then close it.

linminhtoo commented 3 years ago

Hello again, just wondering how long training is supposed to be with the latest configs. From the repository i remember reading 10 hours on USPTO-50K but from the paper 16 hours, but i suppose the configs have changed since then? I see that the early stop patience used to be 8 but now 16.

Currently I'm at epoch 56 and it just got a new best val top-1 of 49.02% (train top-1 86.63%), which means it's going to be at least 16 epochs before the training ends (LR is 1.25e-8). 1 epoch takes ~20 mins (plus minus) on 1x RTX2080Ti, so it's been about 20 hours. I'm wondering if the patience of 16 is really necessary, since it seems like it would take at least 25 hours (20 hours + 16 more epochs)

Also can you confirm how long evaluation should take on test set of 5k reactions with beam_size=50?

mikolajsacha commented 3 years ago

In our case the training took 35 epochs, which took about 10 hours on a single GTX1070. Maybe the longer training stems from different dataset and random weights initialization. You can safely interrupt the training with keyboard interrupt as weights are saved every epoch. Patience of 16 is very conservative and could probably be lowered to 8 or even 6.

Evaluation should take about an hour on 5k samples and beam size 50.

linminhtoo commented 3 years ago

Hi mikolaj,

I restarted a separate training session with LR patience of 4 and early stop patience of 8. I also increase LR factor to 0.1 from 0.05, to follow the values given in the paper.

I now face 2 problems:

I have 39713 reactions in train and 4989 in validation. However I see from the logs that: Training on chunk of 39643 training samples and 4980 valid samples This is problematic for me as I (ideally) need the model to train on all 39713 reactions. Can I do something (like relax some filter) to restore 39643 back to 39713? (and similarly for validation). I see from logs that:

2021-04-29 05:50:03,714 - src.feat.megan_graph - INFO - Concatenating metadata
2021-04-29 05:50:03,725 - src.feat.megan_graph - INFO - Saving found actions
2021-04-29 05:50:06,290 - src.feat.megan_graph - INFO - Found 54 reaction actions
2021-04-29 05:50:06,291 - src.feat.megan_graph - INFO - Number of steps: max: 16, avg: 4.234206575084597
2021-04-29 05:50:06,291 - src.feat.megan_graph - INFO - Saving featurized data
2021-04-29 05:50:10,654 - src.feat.megan_graph - INFO - Saved 44623/44702 reactions (99.8232741264373%)
2021-04-29 05:50:10,655 - src.feat.megan_graph - INFO - Saved 44623 paths (avg. 1.0 paths per reaction)
2021-04-29 05:50:10,655 - src.feat.megan_graph - INFO - Saving featurization metadata

which suggests setting a higher max_step than 16 might fix it? Failing which, during inference can I still get top-50 predictions on all 39713 training samples?

annoying CUDA out of memory problem: The second training run seemed to be running fine but halfway, I encountered an annoying CUDA issue. I'm not sure is it the GPU problem or some subtle code issue, but I tried on different nodes in my HPC and the error still happens. I restarted the training again but during first few batches of the first epoch the error happens again.

Traceback (most recent call last):
  File "bin/train.py", line 356, in <module>
    dispatch_utils.dispatch_configurable_command(train_megan)
  File "/home/linmin001/megan_0/src/utils/dispatch_utils.py", line 187, in dispatch_configurable_command
    os.path.join(save_path, "stderr.txt"))(save_path)
  File "/home/linmin001/megan_0/src/utils/dispatch_utils.py", line 119, in func_wrapper
    func(*args, **kwargs)
  File "/home/linmin001/.conda/envs/megan/lib/python3.6/site-packages/gin/config.py", line 1078, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/linmin001/.conda/envs/megan/lib/python3.6/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/home/linmin001/.conda/envs/megan/lib/python3.6/site-packages/gin/config.py", line 1055, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "bin/train.py", line 316, in train_megan
    lr_step=warmup_lr_step if epoch_i < megan_warmup_epochs else 0.0)
  File "bin/train.py", line 223, in run_epoch
    raise e
  File "bin/train.py", line 212, in run_epoch
    batch_metrics = run_batch(all_ind, train)
  File "bin/train.py", line 136, in run_batch
    loss = -torch.log2(actions + ~target_one + eps) * target_one * weight
RuntimeError: CUDA out of memory. Tried to allocate 34.00 MiB (GPU 0; 10.76 GiB total capacity; 1.56 GiB already allocated; 20.75 MiB free; 159.17 MiB cached)
  In call to configurable 'train_megan' (<function train_megan at 0x2aab38336510>)

As you can see the error message is not helpful because clearly the GPU has enough memory left but somehow cannot be allocated. I am using 8 cores with 8 GB each so 64 GB of RAM which should be more than enough. It is extremely strange because the first time I could run the same code to epoch 50+ as I mentioned earlier :( I also re-ran the data pre-processing steps.

Have you ever encountered this while running experiments? I tried to follow all the advice from https://github.com/pytorch/pytorch/issues/16417 but none seems to solve it. If this is your first time seeing such an issue, I guess it's probably some GPU hardware issue on my end.

EDIT: sorry, this might be quite overwhelming. I would really appreciate your help!

mikolajsacha commented 3 years ago

Hi

Ad 1. The smaller number of samples featurized by MEGAN can be caused by two factors: 1) RdKit could not read SMILES of some samples (which should not be the case for USPTO-50k, which is 100% parsable by RdKit) or 2) We remove from train/valid sets reactions that have more than the maximum (16) generation steps. You can try increasing this hyperparameter but from our experience it is not worth it in terms of Top K on USPTO-50k, as reactions with number of steps > 16 are still very unlikely to be correctly generated during evaluation. This hyperparameter does not affect reactions selected for the evaluation, which still runs on your whole dataset and the model will try to generate reactions for all input molecules. You can even set a different "max_gen_steps" during evaluation in bin/eval.py script.

Ad 2. The error suggests that you exceeded GPU RAM, which seems very unlikely if you have 64 GB. We trained on a GPU with 8 GB RAM and did not get any such error. In your error message I see 10.76 GiB total capacity, which suggests that in fact the process has less GPU RAM available, maybe this indicates some other problem? You can also try lowering batch size, however it is already small (4). Sorry for not being too helpful on this but we did not encounter such problems on our side.

I hope you will be able to solve the remaining issues and I wish you best in your research

linminhtoo commented 3 years ago

Thanks for the explanation! Really appreciate it. Re: issue 2), Most likely just some temporary issue with my compute cluster.

linminhtoo commented 3 years ago

Hello, just wanted to update that the root cause was in the env.sh If you're on a cluster, line 29 export CUDA_VISIBLE_DEVICES=0 can be very problematic. Only after a second look through the codebase, I figured this could cause the problem. If your job is on a node with multiple GPUs, where other jobs are also running and using those GPUs, specifying GPU:0 can cause conflicts in GPU usage with cryptic errors like the one I encountered. Commenting this line out solved the issues. You might wish to update the README :)

Also for configs/uspto_50k.gin, I reduced the lr_patience, lr_factor and start_epoch according to your reply and what is in the paper, to:

train_megan.learning_rate = 0.0001
train_megan.gen_lr_factor = 0.1
train_megan.gen_lr_patience = 4
train_megan.early_stopping = 8

This gave me 48.92% best top-1 validation accuracy on epoch 31 and early stopped after 40 epochs, which took about 14 hours on 1 NVIDIA RTX2080Ti. I think this is reasonably close to your results.

My dataset has 39.7k/5k/5k train/valid/test reactions, and just in case, I also re-canonicalized all atom-mapping using the script from RetroXpert. (see https://github.com/uta-smile/RetroXpert/blob/canonical_product/canonicalize_products.py) and it did not affect the results, which confirms that megan is unaffected by the data leak.

Thanks for your help so far and I am going to close this issue.

molecule-one / megan

Nice work - are you aware of the USPTO data leak? (+ training issue on HPC) #4