raver8 / ML_chemical

0 stars 0 forks source link

31 duplicate chemical records (by InChIKey) #11

Closed jschrier closed 10 months ago

jschrier commented 11 months ago

I've added a check for duplicated InChIKeys in the chemical dictionary entries.

The new merge_chemical_dictionaries.nb includes checks for duplicate InChIKey within a dictionary and between a dictionary.

I've found 31 instances of duplications. @raver8 is the worst offender

 <|"../chemical_dictionaries/chem_dictionary_records.json" -> \
<|"SPPWGCYEYAMHDT-UHFFFAOYSA-N" -> 2, 
   "SEGLCEQVOFDUPX-UHFFFAOYSA-N" -> 3, 
   "JVTAAEKCZFNVCJ-UHFFFAOYSA-N" -> 2, 
   "QPCDCPDFJACHGM-UHFFFAOYSA-N" -> 2, 
   "BDAGIHXWWSANSR-UHFFFAOYSA-N" -> 4, 
   "QTBSBXVTEAMEQO-UHFFFAOYSA-N" -> 2, 
   "XBDQKXXYIPTUBI-UHFFFAOYSA-N" -> 2, 
   "FERIUCNNQQJTOY-UHFFFAOYSA-N" -> 2, 
   "KRKNYBCHXYNGOX-UHFFFAOYSA-N" -> 2, 
   "OFOBLEOULBTSOW-UHFFFAOYSA-N" -> 2, 
   "BWLBGMIXKSTLSX-UHFFFAOYSA-N" -> 2, 
   "DHMQDGOQFOQNFH-UHFFFAOYSA-N" -> 2, 
   "KBPLFHHGFOOTCA-UHFFFAOYSA-N" -> 2, 
   "LQNUZADURLCDLV-UHFFFAOYSA-N" -> 2, 
   "DVARTQFDIMZBAA-UHFFFAOYSA-O" -> 2, 
   "KZPDJENXRHZMGL-UHFFFAOYSA-N" -> 2, 
   "UQLASUALFZVGHF-RNPORBBMSA-N" -> 2, 
   "UQLASUALFZVGHF-KYJUHHDHSA-N" -> 2, 
   "LNCOHMATSYURAP-UHFFFAOYSA-N" -> 2, 
   "NGSFWBMYFKHRBD-UHFFFAOYSA-M" -> 2, 
   "URDCARMUOSMFFI-UHFFFAOYSA-N" -> 2, 
   "RAEOEMDZDMCHJA-UHFFFAOYSA-N" -> 2, 
   "SNRUBQQJIBEYMU-UHFFFAOYSA-N" -> 2, 
   "QEHSHNRQHFAQEH-UHFFFAOYSA-N" -> 2, 
   "BNKPOTVRINVBDS-UHFFFAOYSA-N" -> 2, 
   "STCOOQWBFONSKY-UHFFFAOYSA-N" -> 2|>, 
 "../chemical_dictionaries/chem_dictionary_records_OV.json" -> \
<|"VRZYWIAVUGQHKB-UHFFFAOYSA-N" -> 2, 
   "HIELYQQIYOVIAZ-UHFFFAOYSA-N" -> 2|>, 
 "../chemical_dictionaries/chem_dictionary_records_RM.json" -> \
<|"SNRUBQQJIBEYMU-UHFFFAOYSA-N" -> 2, 
   "STCOOQWBFONSKY-UHFFFAOYSA-N" -> 2, 
   "QSFRTQJEHPGZSO-UHFFFAOYSA-N" -> 2|>|>

<|"../chemical_dictionaries/chem_dictionary_records.json" -> 26, 
 "../chemical_dictionaries/chem_dictionary_records_OV.json" -> 2, 
 "../chemical_dictionaries/chem_dictionary_records_RM.json" -> 3|>

ACTION: Merge these intelligently (make sure that synonyms are correct and we have the most expansive set of synonyms) after solving the previous issue

oliviavanden commented 11 months ago

All issues have been resolved with the duplicate InChIKey within a dictionary.

jschrier commented 11 months ago

There are still 11 duplicates "between" dictionaries (i.e., entries that have the same InChIKey in one dictionary and in another dictionary)

oliviavanden commented 11 months ago

That's my bad for closing it accidentally. I apologize!

jschrier commented 11 months ago

No need for apologies!

On Fri, Oct 13, 2023, 07:30 oliviavanden @.***> wrote:

That's my bad for closing it accidentally. I apologize!

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raver8_ML-5Fchemical_issues_11-23issuecomment-2D1761523788&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=TkdkMZKgCpYcE_rS3xubC7pX-Fv1fDBJWWAItU-ijMU&m=KsLLXLcgailLptlnxC1v73QxUtLRK3_FJsC8x8_5P_a9UaCGetGueE8igUzj8oOT&s=K7XjBS13gIhv4SFPtGL7lFTFPTRB0k2aD2qMdR7nuvo&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AB3WW56LRV7YDTWDIIFCVTLX7E66PAVCNFSM6AAAAAA5RE73LOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRRGUZDGNZYHA&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=TkdkMZKgCpYcE_rS3xubC7pX-Fv1fDBJWWAItU-ijMU&m=KsLLXLcgailLptlnxC1v73QxUtLRK3_FJsC8x8_5P_a9UaCGetGueE8igUzj8oOT&s=nFw7qqxhkTqsZ4lM_41y9Ul0AV16dz8MIhEw1zK5UtA&e= . You are receiving this because you modified the open/close state.Message ID: @.***>

oliviavanden commented 10 months ago

InChIKey duplicates are being shaven down. One issue is because the RM and OV chemical dictionary records are still in the review, they will still register as duplicates in the Mathematica script. Should I delete the OV and RM chemical dictionary records from the review as well?

jschrier commented 10 months ago

Should the "review" branch be merged into "main"?

Then, when you switch to main, you should only have one file present...

oliviavanden commented 10 months ago

I merged the review into the main, and when I switch to main only one file is present. However, when I switch to review, three files are present. I'm not sure if that is what is causing an issue for me because I'll open up the Mathematica script, and it will say 2 repeats of InChIKeys between scripts. I go to check which molecules are repeating, and then only one of that molecule appears in the combined chemical dictionary record. I'll check the OV or RM code, and it will be present there. Essentially, the Mathematica script is telling me there are multiple repeats when in the combined record there are none. It is only what is between the records that is appearing.

oliviavanden commented 10 months ago

Many of these duplicate InChIKeys were simply duplicate molecules. The only issue where I got stuck was when there were different SMILES for the different repeats. However, they were all correlating with the same molecule.