nih-cfde / helpdesk-collaboration

1 stars 0 forks source link

HELP-406 UniProtKB ac to Ensembl gene mapping issue #3

Open jeremywalter opened 1 year ago

jeremywalter commented 1 year ago

During the last submission, we were unable to add gene information as the validation script found an error in collection_gene.tsv. When we investigated the reason, we found that the file was failing because of the discrepancy between UniProtKB ac to Ensembl gene mapping. Below are the reasons why it failed and possible solutions to apply.

Two issues that surfaced -
1) Invalid UniProtKB ac to Ensembl gene mapping - Because of the asynchronous versions of UniProtKB data used by GlyGen and CFDE this issue arose and will possibly arise the next time as well.

2) Proteins having no Ensembl gene mapping - There are few thousands of proteins that do not have corresponding Ensembl IDs in Ensembl.
Suggested solutions -

For 1) The UniProtKB ac to Ensembl gene mapping should be automatically annotated by CFDE rather than the DCC. This will resolve the synchronicity issue if DCC uses older version data. If the protein_gene is automatically (as mentioned on the page) created by the script there won't be any discrepancies because of the versions.

For 2) If there are no UniProtKB ac to Ensembl gene mapping then the script should ignore such entries but validate the proteins. To avoid such issues a secondary UniProtKB ac to gene mapping can be used for eg NCBI GeneID. There could still be few entries without gene mapping.

ReneRanzinger commented 1 year ago

This is related to https://github.com/glygener/glygen.cfde.generator/issues/21 and https://github.com/glygener/glygen.cfde.generator/issues/17.

jeet-vora commented 1 year ago

@jonathancrabtree @nsuvarnaiari @mgiglio99 @ReneRanzinger

As per our discussion last week about UniProtKB ac to Ensembl gene mapping issue, what we proposed is below (also in the above comment) , also sharing our files so you can run and see the errors.

1) The UniProtKB ac to Ensembl gene mapping should be automatically annotated by CFDE rather than the DCC. This will resolve the synchronicity issue if DCC uses older version data.

2) Overall, relax the stringent requirement of UniProtKB ac to Ensembl gene mapping, which will allow the UniProtKB accessions without Ensembl gene mapping to make it to the portal, including virus proteins.

Here is the folder with GlyGen generated files for submission. The folder has two zip files with TSVs 1) tsv_unfiltered.zip that has proteins with mapping issues (also some gene and glycans) 2) tsv_unfiltered.zip that has proteins with mapping issues filtered out (also some gene and glycans).

error_cfde_jan_submission.txt contains the erroneous proteins, glycans and genes. CFDE-GlyGen.zip

If you want to generate the TSV files yourself, here is the readme for the code - https://github.com/glygener/glygen.cfde.generator#readme

nsuvarnaiari commented 1 year ago

Hi @jeet-vora @ReneRanzinger

We are planning to update UniProtKB (protein.tsv.gz) available on OSF to a newer version downloaded on April 26th. The version that is currently available on OSF is from Nov 09, 2022. Please let us know which version you would prefer using for your June submission. If you want us to update to the newer version ( April 26), we will start the process this week. If the newer version is going to cause issues at your end and want to stick with the current version, then we will not update it. Please let us know.

Thanks, Suvvi @jonathancrabtree @mgiglio99 @RLC-DCPPC

jeet-vora commented 1 year ago

Hi @nsuvarnaiari,

We would prefer if you can update the protein.tsv.gz with the newer UniProtKB release.

The newer version will update the accession and mapping space but it won't necessarily solve the UniProtKB to Ensembl gene mapping issue entirely.

If you can implement suggestions 1 and 2 from the above comment, the severity of the issue will be reduced drastically. We will still have issues with the mapping but the number would be few rather than in thousands (current case), which we can deal at our end.

I am not sure if you were able to replicate the errors for the UniProtKB ac to Ensembl gene mapping using the files above. Let us know if you need any help.

@jonathancrabtree @mgiglio99 @RLC-DCPPC @ReneRanzinger