glygener / glygen.cfde.generator

Java program for the generation of CFDE metadata files from GlyGen data.
GNU General Public License v3.0
0 stars 1 forks source link

Virus protein accessions without ensemble mapping #19

Open jeet-vora opened 2 years ago

jeet-vora commented 2 years ago

Hi Jessica and Arthur, In GlyGen we have protein and glycan data for medically important virus species like SARS-CoV, and HCV. We are planning to submit the data for these species however the proteins do not have ENSEMBL mapping as viruses do not have chromosomes.

Do you have any suggestions on how we can tackle this in order to submit the data? Thanks We are also adding mouse and rat data, so if any issue arises we will bring to your attention. Maybe in few days we can get together on a call to discuss these few issues including the ones reported by Rene.

ReneRanzinger commented 2 years ago

Arthur Brady commented:

You could submit UniProt accessions, if they exist, and describe the proteins just as proteins, not genes. We do not in principle support draft genomic data because of its basic instability, although if there’s some entity or organization governing covid gene nomenclature, using data from such a source might be a possibility. Ensembl only provides IDs for genes for selected model organisms (although there are a large number of them) – for genes from organisms not represented in Ensembl, we can import IDs from other spaces (as we have done for GlyTouCan IDs not present in PubChem), but we would still need some sort of ID-issuing authority to have created stable identifiers for the genetic objects in question. Until/unless that’s done, we won’t be able to integrate draft (or anonymous) data alongside stable identifiers, for obvious reasons.

nsuvarnaiari commented 1 year ago

Hi @jeet-vora and @ReneRanzinger

I think this is still an open issue. As Arthur mentioned in his comment, if you know a reliable, stable authoritative source for viral genes (COVID and HCV), we can import those IDs to include in our controlled vocabulary so that you can start using them in your next submission (next year). Let us know.

Thanks, Suvvi @jonathancrabtree @mgiglio99

jeet-vora commented 1 year ago

Hi @nsuvarnaiari

There is no gene nomenclature resource as of now for viruses. We use UniProt for virus genes as it currently the best and curated resource for virus proteins and genes. Another option is to use NCBI Gene. In GlyGen the protein and gene related information comes from UniProt and most of them can be mapped to Ensembl Gene ID and/or NCBI GeneID.

@ReneRanzinger @jonathancrabtree @mgiglio99

nsuvarnaiari commented 1 year ago

HI @jeet-vora

NCBI Gene sounds like a good option if that works for you. Will you be able to provide us a list of NCBI Gene IDs for your set of viral genes?

Thanks, Suvvi @jonathancrabtree @mgiglio99