How to filter out uncommon templates

kmaziarz commented 12 months ago

I'm trying to set a cutoff on the minimum number of occurrences of a local template for it to be included in the model. As far as I understand, this is done by passing the --min-template-n (e.g. --min-template-n 5) flag in Run_preprocessing.py. However, I'm confused about the fact that the uncommon templates seem to stick around, e.g. when I launch Train.py, it prints the parameters of LocalRetro, and AtomTemplate_n / BondTemplate_n that are displayed are the counts of all templates (i.e. these numbers don't depend on min_template_n). Shouldn't the uncommon templates be removed and affect the values of {Atom,Bond}Template_n passed to the model class?

shuan4638 commented 12 months ago

Hi,

The number of {Atom,Bond}Template_n is decided by the template files extracted from the dataset (see line 35, 36 at https://github.com/kaist-amsg/LocalRetro/blob/main/scripts/utils.py). Because these files are generated by Extract_from_train_data.py, the number of classes cannot be changed by changing --min-template-n flag in Run_preprocessing.py. Instead, the --min-template-n flag should be added and specified in Extract_from_train_data.py in order to changed the number of template classes.

kmaziarz commented 11 months ago

I see; so this is a new feature you added just now in your latest commit? Before that commit, what was the effect of using the --min-template-n flag in Run_preprocessing.py?

shuan4638 commented 11 months ago

Yes, I tested and added this new feature after knowing this issue.

The --min-template-n flag in Run_preprocessing.py was used to mask the reactions that have reaction tempaltes with low popularity in the dataset (see line 188). I found many of these reactions are likely to have wrong atom-mapping or chemically unreasonable reaction, and learning from these noisy reactions may harm the model's performance.

I did not set --min-template-n in Extract_from_train_data.py before because I encouraged people actaully looking through the reaction tempaltes with low popularity before filtering them out. Some of them may be chemically correct and precious reactions that we don't want to miss.

kmaziarz commented 11 months ago

So what would happen if one were to set --min-template-n only in Run_preprocessing.py? The rare template classes would be included in the model, but there would be no training samples where these are used (because they got filtered out), so the model would technically have these classes in its classification layer, but it will likely learn to never use them?

shuan4638 commented 11 months ago

That's a good point, that would only be a huge waste of computational resources. Then maybe the best way is to change the way of counting {Atom,Bond}Template_n in scripts/utils.py so the number of classes would automatically fit the number existing in the preprocessed files. I will think about how to fix it and update here.

Thanks!

shuan4638 commented 11 months ago

I thought about this issue and I think the best way is defining the --min-template-n flag in data/default_config.json and the modify the function get_configurein scripts/utils.py to load the number of template depending on the defined number.

Since this change may require a large amount of code modification, I won't change this at the moment. But I'll consider this feature in the next version of LocalRetro, which we are currently developing.

kmaziarz commented 11 months ago

I don't fully understand: your commit 3b282b6 added an option to filter out uncommon templates in Extract_from_train_data.py. Will using this new feature not prevent the downstream model from having unnecessary output classes?

In other words, could you elaborate on what usecase you addressed in your commit vs what usecase is still not supported?

shuan4638 commented 11 months ago

Will using this new feature not prevent the downstream model from having unnecessary output classes?

Yes. After adding the option in Extract_from_train_data.py, the issue can be solved directly by exporting fewer templates in atom_templates.csv and bond_templates.csv (the number of output classes in LocalRetro depends on the number of templates recorded in these two files).

In other words, could you elaborate on what usecase you addressed in your commit vs what usecase is still not supported?

The commit should solve the issue (filtering out uncommon templates) in any cases. I am suggesting extra changes because this change (setting higher --min-template-n in Extract_from_train_data.py) will erase the information for templates with fewer popularity in the output template file, making it difficult to track what templates were removed.

kmaziarz commented 11 months ago

I see. I agree being able to vary the template set at model training time (as opposed to data preprocessing time) would be nice to have, but it's not crucial. Thanks!

kaist-amsg / LocalRetro

How to filter out uncommon templates #17