Add args for prediction type - Githubissues

sustainable-processes / ORDerly

Chemical reaction data & benchmarks. Extraction and cleaning of data from Open Reaction Database (ORD)

MIT License

67 stars 8 forks source link

Add args for prediction type #2

Open dswigh opened 1 year ago

dswigh commented 1 year ago

Add 2 new args:

1) prediction_type (or something like that): e.g. yield prediction, only_mapped_reaction, condition_prediction

If user only wants the mapped reaction strings, we should by-pass the sanity-checks for the reaction conditions, ultimately resulting in a larger dataset to work with. Likewise for yield prediction (we remove reactions without yields) etc.

2) Data_set: only_uspto, all_available

For benchmarking purposes, it would be great to have an option that always generates the same dataset (e.g. only USPTO data), and another option that just includes all data currently stored in USPTO

dswigh commented 1 year ago

Instead of having a 'prediction type', let's create two flat file benchmarks, both just extracting USPTO data, but one with default settings that removes/handles reactions with uncommon molecules, and another with all the arg settings set to 0.
This has been implemented!

dswigh commented 1 year ago

When creating flat files for benchmarking, we should creat train/val/test splits (80/10/10), splitting the data in 3 different ways: random, temporal (by grant date), and rxn class (both by super class (very hard) and by sub-classes (medium difficulty)).