mwang87 / SMART_NMR

Repository to help develop SMART
https://smart.ucsd.edu/
MIT License
0 stars 0 forks source link

Validation issue #159

Closed ChungmaruQ closed 4 years ago

ChungmaruQ commented 4 years ago

Problem: some of compounds have different styles of SMILES in json files of 2.0 and 2.1.

Reason why? : The metadata in SMART 2.1's json file is totally regenerated and re-validated. ex) Name, SMILES In 2.0 : Turkesterone(Calcd_CDCl3), CC(C)(O)CCC(O)C(C)(O)C1CCC2(O)C3=CC(=O)C4CC(O)C(O)CC4(C)C3C(O)CC12C In 2.1: Turkesterone, CC(C)(O)CCC@@HC@(O)[C@H]1CC[C@@]2(O)C3=CC(=O)[C@@H]4CC@@HC@@HC[C@]4(C)[C@H]3C@HC[C@]12C but, both of them have same embeddings, so showed same cosine score in the result

Solution: using first block of inchi_key for validation In 2.0 : Turkesterone(Calcd_CDCl3), WSBAGDDNVWTLOM In 2.1: Turkesterone, WSBAGDDNVWTLOM

In the next, new data will be just stacked over the current json file, so this problem will not be happned.

FAQ

  1. Why the metadata in SMART2.1 was reconstructed?

    The metadata used in SMART 2.0 were from webscraped data from Chen. But it had many of encoding issue. So I generated new data from many of new source and used them for constructing SMART 2.1 json file.

  2. Why there are so many duplicates in SMART 2.1? There are 6 kinds of swinholide A in the DB

    All swinholides A have different NMR spectra which are derived from different solvent, different literature, planar or streo structures of swinholides A

  3. Some compounds are missing in new SMART 2.1

    There are two chance for the compounds to be missed. First, new data can be squeezed into the previous results and previous results can be squeezed out of top 100. Second, really missing in DB. Some of compounds (especially most of them from Marinlit) are disappeared from SMART 2.1 including ilamycin (sorry Raphael, my fault). But there are many replacement so no matter

And if you have any question, I'll do my best to answer

RaphaelR87 commented 4 years ago
  1. What does 'n.a.' mean? Not available? n a
RaphaelR87 commented 4 years ago

to 3. there seems to be a third option: Compounds seem to be missing but their label was deleted or somehow changed to 'unnnamed' unnamed_ilamycinC1

RaphaelR87 commented 4 years ago
  1. What does the 'source' GNPS mean? I assume you mean that the SMILES are derived from GNPS, right? It seems to be a little misleading to me as the other two (ACDlabs/JEOL) are the only real sources in terms of HSQC spectra
ChungmaruQ commented 4 years ago

N.A, unnamed, or something else are derived from each of database like npatlas, marinlit, npass or by myself. It is hard to match all compound's 'Synonym' with SMILES. So no plan to fix the issue. If user want to know the synonym, they can download the result and find the molecule's detail by themselves I checked the population of unnamed or n.a., and the number is 8461. In other words, 95,888 compounds are okay

ChungmaruQ commented 4 years ago
  1. What does the 'source' GNPS mean? I assume you mean that the SMILES are derived from GNPS, right? It seems to be a little misleading to me as the other two (ACDlabs/JEOL) are the only real sources in terms of HSQC spectra

You're correct. But it has some purpose for 'grant proposal' too to show the connection between GNPS and SMART. By the way, if we don't have a plan to provide NMR spectra, then is it useful to provide the 'from' column from result? I think it is more clear way to remove the 'from' column from result to remove ambiguity.

Fixed

ChungmaruQ commented 4 years ago

Fixed update link: https://www.dropbox.com/s/ds2u2173ie3unov/DB_07022020_SM2.1%28100K%29_forRR.json?dl=0

changelog:

RaphaelR87 commented 4 years ago

awesome thanks :) And for the future, please let me know how I can help with those things

mwang87 commented 4 years ago

Deployed to production at smart.ucsd.edu. Check when you have time. I cut an official release as well. Closing issue.