raver8 / ML_chemical

0 stars 0 forks source link

Duplicate names across different chemical dictionary entries #12

Closed jschrier closed 1 year ago

jschrier commented 1 year ago

As suggested by @oliviavanden , I implemented a check as to whether two different entries have one names-list entry that is the same (as each "name" is potentially an identification for the extraction record, they must be unique otherwise it will be unclear which refers to which)

There are entries in ../chemical_dictionaries/chem_dictionary_records.json that have no name specified!

There are also about a dozen entries in which the same name is used in entries in different dictionary files. But I suspect that these might be resolved automatically if one first resolves the open issue #11 same-inchi-across-two-different-dictionaries

<|"../chemical_dictionaries/chem_dictionary_records.json" -> {"No \
name specified"}, 
 "../chemical_dictionaries/chem_dictionary_records_OV.json" -> {}, 
 "../chemical_dictionaries/chem_dictionary_records_RM.json" -> {}|>

{"n-Dodecane", "112-40-3", "Dihexyl", "111-87-5", \
"2,2'-Oxybis(N,N-dioctylacetamide)", "342794-43-8", "No name \
specified", "Dicyclohexano-18-crown-6", "16069-36-6", \
"Dicyclohexyl-18-crown-6", "cis-Dicyclohexano-18-crown-6", \
"18-Crown-6", "N,N'-dibutyl-N,N'-di(1-methylheptyl)-diglycolamide"}
oliviavanden commented 1 year ago

That's really interesting that some have no name specified! I'll look into the issue and see what's going on with it!

jschrier commented 1 year ago

it may also be the case that the "name" dictionary key is incorrect for an entry (analogous to the inchi error we found yesterday)

On Fri, Oct 13, 2023, 07:31 oliviavanden @.***> wrote:

That's really interesting that some have no name specified! I'll look into the issue and see what's going on with it!

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raver8_ML-5Fchemical_issues_12-23issuecomment-2D1761525142&d=DwMCaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=TkdkMZKgCpYcE_rS3xubC7pX-Fv1fDBJWWAItU-ijMU&m=SXnQkr9ZYtK5tw36H4jhc9exy-_TgRxOMCqci1L7hjNaPM2YrCqjZbsQZK1P8ARp&s=m50O2OqyUwh36whAXYkI5TiyiyAvKP-469S0ykouLVc&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AB3WW52HYDSOLERAZV56IK3X7E7B5AVCNFSM6AAAAAA56MGFLGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRRGUZDKMJUGI&d=DwMCaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=TkdkMZKgCpYcE_rS3xubC7pX-Fv1fDBJWWAItU-ijMU&m=SXnQkr9ZYtK5tw36H4jhc9exy-_TgRxOMCqci1L7hjNaPM2YrCqjZbsQZK1P8ARp&s=dFRo51D-aY4sSKUdP65SYkKtYyYob7HvOjDBKlEMRZw&e= . You are receiving this because you authored the thread.Message ID: @.***>

oliviavanden commented 1 year ago

Some names have different SMILES now that they're all combined into one document. For one molecule, it was the same information, but different SMILES. I have to look into this further.

oliviavanden commented 1 year ago

Most of the names were just along with duplicate InChIKeys, and were the same molecule. The only issue I ran into was with 18-crown-6 and potassium acetate 18-crown-6.