open-reaction-database / ord-data

Official data repository for the Open Reaction Database
https://open-reaction-database.org
Creative Commons Attribution Share Alike 4.0 International
219 stars 55 forks source link

Add USPTO-480K dataset from https://doi.org/10.1039/C8SC04228D #61

Closed skearnes closed 3 years ago

skearnes commented 3 years ago

Here's the notebook I used to create these: uspto-480k.zip. Only the reaction SMILES are included; pending https://github.com/open-reaction-database/ord-schema/pull/559 for the validations to pass.

The data in https://github.com/connorcoley/rexgen_direct is licensed under GPL-3.0, and @connorcoley has consented to re-license for inclusion in the ORD.

skearnes commented 3 years ago

Updated to use only reaction SMILES and not add in made-up stuff to make the validations happy; PTAL.

connorcoley commented 3 years ago

The information after the reaction SMILES is an annotation of bond changes during the reaction, not actually information related to the reaction following the extended SMILES format. Everything after the space should just be discarded and the type changed to plain old REACTION_SMILES (also the description updated to reflect that these aren't REACTION_CXSMILES)

skearnes commented 3 years ago

The information after the reaction SMILES is an annotation of bond changes during the reaction, not actually information related to the reaction following the extended SMILES format. Everything after the space should just be discarded and the type changed to plain old REACTION_SMILES (also the description updated to reflect that these aren't REACTION_CXSMILES)

Thanks; fixed.