raver8 / ML_chemical

0 stars 0 forks source link

31 duplicate chemical records (by InChIKey) #11

Closed jschrier closed 1 year ago

jschrier commented 1 year ago

I've added a check for duplicated InChIKeys in the chemical dictionary entries.

The new merge_chemical_dictionaries.nb includes checks for duplicate InChIKey within a dictionary and between a dictionary.

I've found 31 instances of duplications. @raver8 is the worst offender

 <|"../chemical_dictionaries/chem_dictionary_records.json" -> \
<|"SPPWGCYEYAMHDT-UHFFFAOYSA-N" -> 2, 
   "SEGLCEQVOFDUPX-UHFFFAOYSA-N" -> 3, 
   "JVTAAEKCZFNVCJ-UHFFFAOYSA-N" -> 2, 
   "QPCDCPDFJACHGM-UHFFFAOYSA-N" -> 2, 
   "BDAGIHXWWSANSR-UHFFFAOYSA-N" -> 4, 
   "QTBSBXVTEAMEQO-UHFFFAOYSA-N" -> 2, 
   "XBDQKXXYIPTUBI-UHFFFAOYSA-N" -> 2, 
   "FERIUCNNQQJTOY-UHFFFAOYSA-N" -> 2, 
   "KRKNYBCHXYNGOX-UHFFFAOYSA-N" -> 2, 
   "OFOBLEOULBTSOW-UHFFFAOYSA-N" -> 2, 
   "BWLBGMIXKSTLSX-UHFFFAOYSA-N" -> 2, 
   "DHMQDGOQFOQNFH-UHFFFAOYSA-N" -> 2, 
   "KBPLFHHGFOOTCA-UHFFFAOYSA-N" -> 2, 
   "LQNUZADURLCDLV-UHFFFAOYSA-N" -> 2, 
   "DVARTQFDIMZBAA-UHFFFAOYSA-O" -> 2, 
   "KZPDJENXRHZMGL-UHFFFAOYSA-N" -> 2, 
   "UQLASUALFZVGHF-RNPORBBMSA-N" -> 2, 
   "UQLASUALFZVGHF-KYJUHHDHSA-N" -> 2, 
   "LNCOHMATSYURAP-UHFFFAOYSA-N" -> 2, 
   "NGSFWBMYFKHRBD-UHFFFAOYSA-M" -> 2, 
   "URDCARMUOSMFFI-UHFFFAOYSA-N" -> 2, 
   "RAEOEMDZDMCHJA-UHFFFAOYSA-N" -> 2, 
   "SNRUBQQJIBEYMU-UHFFFAOYSA-N" -> 2, 
   "QEHSHNRQHFAQEH-UHFFFAOYSA-N" -> 2, 
   "BNKPOTVRINVBDS-UHFFFAOYSA-N" -> 2, 
   "STCOOQWBFONSKY-UHFFFAOYSA-N" -> 2|>, 
 "../chemical_dictionaries/chem_dictionary_records_OV.json" -> \
<|"VRZYWIAVUGQHKB-UHFFFAOYSA-N" -> 2, 
   "HIELYQQIYOVIAZ-UHFFFAOYSA-N" -> 2|>, 
 "../chemical_dictionaries/chem_dictionary_records_RM.json" -> \
<|"SNRUBQQJIBEYMU-UHFFFAOYSA-N" -> 2, 
   "STCOOQWBFONSKY-UHFFFAOYSA-N" -> 2, 
   "QSFRTQJEHPGZSO-UHFFFAOYSA-N" -> 2|>|>

<|"../chemical_dictionaries/chem_dictionary_records.json" -> 26, 
 "../chemical_dictionaries/chem_dictionary_records_OV.json" -> 2, 
 "../chemical_dictionaries/chem_dictionary_records_RM.json" -> 3|>

ACTION: Merge these intelligently (make sure that synonyms are correct and we have the most expansive set of synonyms) after solving the previous issue

oliviavanden commented 1 year ago

All issues have been resolved with the duplicate InChIKey within a dictionary.

jschrier commented 1 year ago

There are still 11 duplicates "between" dictionaries (i.e., entries that have the same InChIKey in one dictionary and in another dictionary)

oliviavanden commented 1 year ago

That's my bad for closing it accidentally. I apologize!

jschrier commented 1 year ago

No need for apologies!

On Fri, Oct 13, 2023, 07:30 oliviavanden @.***> wrote:

That's my bad for closing it accidentally. I apologize!

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raver8_ML-5Fchemical_issues_11-23issuecomment-2D1761523788&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=TkdkMZKgCpYcE_rS3xubC7pX-Fv1fDBJWWAItU-ijMU&m=KsLLXLcgailLptlnxC1v73QxUtLRK3_FJsC8x8_5P_a9UaCGetGueE8igUzj8oOT&s=K7XjBS13gIhv4SFPtGL7lFTFPTRB0k2aD2qMdR7nuvo&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AB3WW56LRV7YDTWDIIFCVTLX7E66PAVCNFSM6AAAAAA5RE73LOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRRGUZDGNZYHA&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=TkdkMZKgCpYcE_rS3xubC7pX-Fv1fDBJWWAItU-ijMU&m=KsLLXLcgailLptlnxC1v73QxUtLRK3_FJsC8x8_5P_a9UaCGetGueE8igUzj8oOT&s=nFw7qqxhkTqsZ4lM_41y9Ul0AV16dz8MIhEw1zK5UtA&e= . You are receiving this because you modified the open/close state.Message ID: @.***>

oliviavanden commented 1 year ago

InChIKey duplicates are being shaven down. One issue is because the RM and OV chemical dictionary records are still in the review, they will still register as duplicates in the Mathematica script. Should I delete the OV and RM chemical dictionary records from the review as well?

jschrier commented 1 year ago

Should the "review" branch be merged into "main"?

Then, when you switch to main, you should only have one file present...

oliviavanden commented 1 year ago

I merged the review into the main, and when I switch to main only one file is present. However, when I switch to review, three files are present. I'm not sure if that is what is causing an issue for me because I'll open up the Mathematica script, and it will say 2 repeats of InChIKeys between scripts. I go to check which molecules are repeating, and then only one of that molecule appears in the combined chemical dictionary record. I'll check the OV or RM code, and it will be present there. Essentially, the Mathematica script is telling me there are multiple repeats when in the combined record there are none. It is only what is between the records that is appearing.

oliviavanden commented 1 year ago

Many of these duplicate InChIKeys were simply duplicate molecules. The only issue where I got stuck was when there were different SMILES for the different repeats. However, they were all correlating with the same molecule.