open-reaction-database / ord-data

Official data repository for the Open Reaction Database
https://open-reaction-database.org
Creative Commons Attribution Share Alike 4.0 International
219 stars 55 forks source link

Added the two datasets from the surf publication (#195) #199

Open bdeadman opened 1 month ago

bdeadman commented 1 month ago

Borylation and minisci datasets from the SURF publication (ChemRxiv, 2024, 10.26434/chemrxiv-2023-nfq7h-v2 D O I: 10.26434/chemrxiv-2023-nfq7h-v2 [opens in a new tab]). These are reactions which have been collected from the literature and summarised in SURF format by @alexarnimueller.

The Jupyter Notebook used to convert the datasets is located at bdeadman/surf/surf2ord_troubleshooting.ipynb. The surf2ord.py script has been modified to output data into the latest ord-schema version and preferred style.

Notes:

github-actions[bot] commented 1 month ago
Change summary: Filename Added Removed Changed
data/borylation_ord.pbtxt 0 0 0
data/minisci_ord.pbtxt 0 0 0
0 0 0
github-actions[bot] commented 1 month ago
Change summary: Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz 130 0 0
1241 0 0
bdeadman commented 1 month ago

@connorcoley @skearnes I was able to pull this through into open-reaction-database/ord-data:#195 without getting approval. It should probably be checked at this stage before it is approved to go into main.

skearnes commented 1 month ago

@connorcoley @skearnes I was able to pull this through into open-reaction-database/ord-data:#195 without getting approval. It should probably be checked at this stage before it is approved to go into main.

Yes, that's expected; we don't protect any branches except for main by requiring approvals.

github-actions[bot] commented 1 month ago
Change summary: Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz 130 0 0
1241 0 0
skearnes commented 1 month ago

@qai222 do you have time to take a look at these for correctness?

github-actions[bot] commented 1 month ago
Change summary: Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz 130 0 0
1241 0 0
github-actions[bot] commented 1 month ago
Change summary: Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz 130 0 0
1241 0 0
bdeadman commented 3 weeks ago

For item 2, this is how it is recorded in the SURF table file. Since it is a GC yield I suspect it is just a calibration error. While not ideal, I think we need to report it as it is written. There are already >100% yields in the ORD from the USPTO data.

This particular reaction has come from this paper: https://pubs.acs.org/doi/10.1021/acscatal.0c00152. Unfortunately the rxn does not appear in the SI, and I don't have access to the paper.

qai222 commented 3 weeks ago

For item 2, this is how it is recorded in the SURF table file. Since it is a GC yield I suspect it is just a calibration error. While not ideal, I think we need to report it as it is written. There are already >100% yields in the ORD from the USPTO data.

This particular reaction has come from this paper: https://pubs.acs.org/doi/10.1021/acscatal.0c00152. Unfortunately the rxn does not appear in the SI, and I don't have access to the paper.

Yeah the paper says "Yields were determined by gas chromatography and are based on moles of B 2 pin 2." in table 2 caption. I agree we should report as it is.