open-reaction-database / ord-data

Official data repository for the Open Reaction Database
https://open-reaction-database.org
Creative Commons Attribution Share Alike 4.0 International
219 stars 54 forks source link

uploading pbtxt file generated from AstraZeneca ELN #120

Closed bznan closed 2 years ago

bznan commented 2 years ago

A pbtxt file including 750 reactions generated from AstraZeneca by using web editor. The updated file solved one missing SMILES string error.

skearnes commented 2 years ago

Thanks @bznan, the tests are passing now!

connorcoley commented 2 years ago

Thanks, @bznan ! I've looked over the dataset based on the enumerated reactions. Would it be possible for you to share the template pbtxt and .csv used to generate the dataset?

Can you add a brief name and description for this dataset? You can do this on the dataset page view (be sure to save). The name should mention the reaction class and the description should mention the publication/section the dataset corresponds to.

I'm also wondering where these 750 reactions come from relative to the publication -- it looks like the datasets in the paper are from Ahneman et al., Perera et al., and then ELN data. Are these reactions from the ELN or from a previous publication? Just wondering if there's a description of the reactions that I can use to check things over. It looks like information about vessel, setup, stirring, organization, etc. are all unspecified. I can definitely see how that might be the case if this came from ELNs

bznan commented 2 years ago

@connorcoley Yes sure, I can put the template pbtxt and csv file here, or send you through email. I will change the name and description for this dataset and upload it again. This dataset was generated from AstraZeneca ELN, you could find the raw data file, which is a .XML file in the attached Github directory. https://github.com/msaebi1993/yield-rxn/tree/model_5.3.0/jupyter_notebooks/RawDataPreperation

bznan commented 2 years ago

@connorcoley I Just rechecked the ELN raw data, I think a lot of things you mentioned including vessel, setup, stirring are in the comment section which should be generated by hand, and this current pbtxt version including all the useable reaction information in our previous yield prediction work. I could certainly start to generate the other information if you think it's necessary.

connorcoley commented 2 years ago

Thanks!

If the vessel, setup, stirring, etc. are all currently stored as unstructured text and it would be a manual procedure to go through every single one to structure it, I might suggest just trying to place it in the procedure_details field so the information is not lost. Would that be possible?

It looks like the raw format is the UDM! @skearnes and I were recently discussing the prospects of writing an automated translator, so I would be interested in learning more about your experience. Ideally, we could have a lossless 1:1 mapping if the data we are structuring is the same

bznan commented 2 years ago

@connorcoley I have updated the reaction type, created date, and put all the text in the procedure_details. I would say there is still a lot of information that could be generated including Simple ID Reaction ID Sample REF for each reaction component. By far it's kind of hard for me to use a template and a spreadsheet to do that I think the mapping you mentioned could be a good solution. The data structure is kind of the same, except for the unstructured text note. And also there is a lot of data curation work needed for this dataset, as you can see in the last part of the dataset the reagents and reactants were not clearly classified, and also the yield information is kind of messy.

connorcoley commented 2 years ago

Thanks, @bznan ! The instructured text note is a great addition, thanks for doing that.

It looks like the reaction pbtxt template is missing the dollar signs around Note but the enumerated dataset looks okay; was this pbtxt not the final version used?

Also, apologies for being vague when suggesting that you include the template and spreadsheet -- these can be attached to GitHub comments (or pull request comments) as a .zip, but we don't actually want/need them in the pull request itself. It looks like the GitHub actions are failing because of this (i.e., it's trying to parse your template pbtxt as a Dataset). Let's go ahead and remove those files, commit, and push and we can check if the tests are passing.

bznan commented 2 years ago

@connorcoley I first tried using dollar signs but failed, probably because the data type of the input. I found it worked when I directly copy and paste the note in the web editor, so I put in the 'Note' and replace them all by using python script.

connorcoley commented 2 years ago

It looks like there is (only?) one ValidationError stopping us from merging this in, which is the string used for the created date. You've got strings like "Timestamp('2008-07-01 00:00:00')" which can't be parsed. If you simply convert these to strings like "2008-07-01" we should be all set!

connorcoley commented 2 years ago

Terrific! Everything looks good! I'll merge this one in and then create the PR to merge it into the official "main" branch. Thanks a bunch for the contribution

bznan commented 2 years ago

@connorcoley Hi, sorry to inform you. I found there is a problem related to catalyst amount. Currently, the catalyst amount and the metal amount are the same, but actually, they are not. I have corrected the problems, but how can I update that on Github? Should I open a new request? So sorry for that.

connorcoley commented 2 years ago

Ah, can you submit a fresh PR then? I'll delete the old branch