Closed ChungmaruQ closed 4 years ago
to 3. there seems to be a third option: Compounds seem to be missing but their label was deleted or somehow changed to 'unnnamed'
N.A, unnamed, or something else are derived from each of database like npatlas, marinlit, npass or by myself. It is hard to match all compound's 'Synonym' with SMILES. So no plan to fix the issue. If user want to know the synonym, they can download the result and find the molecule's detail by themselves I checked the population of unnamed or n.a., and the number is 8461. In other words, 95,888 compounds are okay
- What does the 'source' GNPS mean? I assume you mean that the SMILES are derived from GNPS, right? It seems to be a little misleading to me as the other two (ACDlabs/JEOL) are the only real sources in terms of HSQC spectra
You're correct. But it has some purpose for 'grant proposal' too to show the connection between GNPS and SMART. By the way, if we don't have a plan to provide NMR spectra, then is it useful to provide the 'from' column from result? I think it is more clear way to remove the 'from' column from result to remove ambiguity.
Fixed
Fixed update link: https://www.dropbox.com/s/ds2u2173ie3unov/DB_07022020_SM2.1%28100K%29_forRR.json?dl=0
changelog:
awesome thanks :) And for the future, please let me know how I can help with those things
Deployed to production at smart.ucsd.edu. Check when you have time. I cut an official release as well. Closing issue.
Problem: some of compounds have different styles of SMILES in json files of 2.0 and 2.1.
Reason why? : The metadata in SMART 2.1's json file is totally regenerated and re-validated. ex) Name, SMILES In 2.0 : Turkesterone(Calcd_CDCl3), CC(C)(O)CCC(O)C(C)(O)C1CCC2(O)C3=CC(=O)C4CC(O)C(O)CC4(C)C3C(O)CC12C In 2.1: Turkesterone, CC(C)(O)CCC@@HC@(O)[C@H]1CC[C@@]2(O)C3=CC(=O)[C@@H]4CC@@HC@@HC[C@]4(C)[C@H]3C@HC[C@]12C but, both of them have same embeddings, so showed same cosine score in the result
Solution: using first block of inchi_key for validation In 2.0 : Turkesterone(Calcd_CDCl3), WSBAGDDNVWTLOM In 2.1: Turkesterone, WSBAGDDNVWTLOM
In the next, new data will be just stacked over the current json file, so this problem will not be happned.
FAQ
Why the metadata in SMART2.1 was reconstructed?
Why there are so many duplicates in SMART 2.1? There are 6 kinds of swinholide A in the DB
Some compounds are missing in new SMART 2.1
And if you have any question, I'll do my best to answer