1) prediction_type (or something like that): e.g. yield prediction, only_mapped_reaction, condition_prediction
If user only wants the mapped reaction strings, we should by-pass the sanity-checks for the reaction conditions, ultimately resulting in a larger dataset to work with. Likewise for yield prediction (we remove reactions without yields) etc.
2) Data_set: only_uspto, all_available
For benchmarking purposes, it would be great to have an option that always generates the same dataset (e.g. only USPTO data), and another option that just includes all data currently stored in USPTO
Instead of having a 'prediction type', let's create two flat file benchmarks, both just extracting USPTO data, but one with default settings that removes/handles reactions with uncommon molecules, and another with all the arg settings set to 0.
When creating flat files for benchmarking, we should creat train/val/test splits (80/10/10), splitting the data in 3 different ways: random, temporal (by grant date), and rxn class (both by super class (very hard) and by sub-classes (medium difficulty)).
Add 2 new args:
1) prediction_type (or something like that): e.g. yield prediction, only_mapped_reaction, condition_prediction
2) Data_set: only_uspto, all_available