The Jupyter Notebook used to convert the datasets is located at bdeadman/surf/surf2ord_troubleshooting.ipynb. The surf2ord.py script has been modified to output data into the latest ord-schema version and preferred style.
Notes:
Provenance data was not found in the Minisci dataset so this is assumed to also be @alexarnimueller
In the borylation dataset several rows had catalyst_1 only defined by the CAS number. In all but 1 of these I have found a SMILES string to approximate the catalyst.
surf2ord now assigns each reagent/catalyst/reactant/solvent to a separate input instead of collecting together them by role. Inputs with multiple components would be used when they are known to be added as a solution or mixture.
rxn_type has been converted to REACTION_TYPE instead of NAME (makes it compatible with ord-schema >0.3.38)
cas numbers have been converted to the CAS_NUMBER type instead of NAME
Isolated analysis type in SURF has been defined as WEIGHT analysis type in ORD.
datast_name and dataset_description options added to surf2ord function. This ensures the output dataset passes validations in ord-schema >0.3.86. Placeholder text is included by default so the text can be edited afterwards.
Fixed the code so it no longer multiples fractional yields by 100 * 100.
Borylation and minisci datasets from the SURF publication (ChemRxiv, 2024, 10.26434/chemrxiv-2023-nfq7h-v2 D O I: 10.26434/chemrxiv-2023-nfq7h-v2 [opens in a new tab]). These are reactions which have been collected from the literature and summarised in SURF format by @alexarnimueller.
The Jupyter Notebook used to convert the datasets is located at bdeadman/surf/surf2ord_troubleshooting.ipynb. The surf2ord.py script has been modified to output data into the latest ord-schema version and preferred style.
Notes: