Compound name formatting is broken for some GlyGen entries

jessicalumian commented 2 years ago

The name for many compounds is have formatting issues. It seems like all of these compounds were submitted by GlyGen. There are a lot of compound entries, so it's hard to tell if all GlyGen entries are broken or just some.

If possible, we should remove the broken names before we release this round of data.

karlcz commented 2 years ago

Don't the compound names come from the CV table prep script that we provide?

I see similar values in the actual compound CV term table accumulated in the submission system. E.g. this link will serve you some raw CSV content: https://app.nih-cfde.org/ermrest/catalog/registry/attribute/CFDE:compound/name::regexp::WURCS/id,name?limit=20&accept=csv

The display is data-driven and uses these actual term records in the C2M2 submissions. It's not clear to me that there is any issue here unless a better conversion from the source ontology is desired. And. I am afraid this would require revising the script and preparing all affected submissions over again with improved compound.tsv content. The results would likely be unsatisfactory if we continue to get ugly compound names in the submission system/release constituents, as there is no mechanism to indicate which value is "correct"...

abradyIGS commented 2 years ago

This is not broken formatting, it's just how PubChem names their compounds.

https://pubchem.ncbi.nlm.nih.gov/#query=10008718

nih-cfde / dashboard

Compound name formatting is broken for some GlyGen entries #132