glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Update Usability Domain for Protein Datasets #1091

Closed ubhuiyan closed 6 months ago

ubhuiyan commented 7 months ago

The usability domain for the pig data either contains outdated or no information and needs an update.

kmartinez834 commented 7 months ago

Many of the existing protein datasets (in all species) say "The dataset is derived from 2019-09 UniProt release." in their Usability Domain. I'm assuming this is outdated, but can you confirm? If so, should we remove this sentence since we won't be updating this section with every release?

Same issue as above for RefSeq...Usability Domains say 2019.

kmartinez834 commented 7 months ago

Update according to the following template: "The {dataset name} dataset contains {organism scientific name} [taxid:{taxid}] UniProtKB canonical accessions...."

And delete the sentence: "The dataset is derived from 2019-09 UniProtKB release."

GLY_000003✔ GLY_000007✔ GLY_000012✔ GLY_000013✔ GLY_000031✔ GLY_000032✔ GLY_000033✔ GLY_000035✔ GLY_000036✔ GLY_000053✔ GLY_000054✔ GLY_000081✔ GLY_000082✔ GLY_000087✔ GLY_000088✔ GLY_000090✔ GLY_000091✔ GLY_000093✔ GLY_000094✔ GLY_000095✔ GLY_000096✔ GLY_000097✔ GLY_000098✔ GLY_000099✔ GLY_000100✔ GLY_000101✔ GLY_000102✔ GLY_000103✔ GLY_000104✔ GLY_000105✔ GLY_000106✔ GLY_000107✔ GLY_000108✔ GLY_000109✔ GLY_000110✔ GLY_000112✔ GLY_000113✔ GLY_000114✔ GLY_000115✔ GLY_000116✔ GLY_000117✔ GLY_000118✔ GLY_000119✔ GLY_000120✔ GLY_000121✔ GLY_000122✔ GLY_000123✔ GLY_000124✔ GLY_000125✔ GLY_000126✔ GLY_000127✔ GLY_000128✔ GLY_000129✔ GLY_000131✔ GLY_000132✔ GLY_000135✔ GLY_000136✔ GLY_000222✔ GLY_000223✔ GLY_000228✔ GLY_000229✔ GLY_000232✔ GLY_000233✔ GLY_000234✔ GLY_000236✔ GLY_000241✔ GLY_000242✔ GLY_000243✔ GLY_000244✔ GLY_000245✔ GLY_000250✔ GLY_000252✔ GLY_000253✔ GLY_000254✔ GLY_000255✔ GLY_000257✔ GLY_000259✔ GLY_000260✔ GLY_000261✔ GLY_000262✔ GLY_000263✔ GLY_000266✔ GLY_000267✔ GLY_000270✔ GLY_000273✔ GLY_000274✔ GLY_000276✔ GLY_000278✔ GLY_000310✔ GLY_000313✔ GLY_000314✔ GLY_000315✔ GLY_000319✔ GLY_000320✔ GLY_000321✔ GLY_000329✔ GLY_000335✔ GLY_000348✔ GLY_000349✔ GLY_000350✔ GLY_000351✔ GLY_000352✔ GLY_000353✔ GLY_000354✔ GLY_000356✔ GLY_000357✔ GLY_000358✔ GLY_000359✔ GLY_000360✔ GLY_000361✔ GLY_000362✔ GLY_000368✔ GLY_000369✔ GLY_000370✔ GLY_000371 ✔ GLY_000372✔ GLY_000373✔ GLY_000374✔ GLY_000375✔ GLY_000376✔ GLY_000377✔ GLY_000378✔ GLY_000379✔ GLY_000380✔ GLY_000381✔ GLY_000382✔ GLY_000383✔ GLY_000384✔ GLY_000385✔ GLY_000386✔ GLY_000390✔ GLY_000391✔ GLY_000395✔ GLY_000396✔ GLY_000397✔ GLY_000398✔ GLY_000399✔ GLY_000400✔ GLY_000401✔ GLY_000457✔ GLY_000458✔ GLY_000464✔ GLY_000466✔ GLY_000468✔ GLY_000469✔ GLY_000523✔ GLY_000524✔ GLY_000530✔ GLY_000597✔ GLY_000598✔ GLY_000599✔ GLY_000640✔ GLY_000646✔ GLY_000742✔ GLY_000751✔ GLY_000759✔ GLY_000829✔ GLY_000830✔ GLY_000831✔ GLY_000835✔ GLY_000838✔ GLY_000840✔ GLY_000844✔ GLY_000846✔ GLY_000848✔ GLY_000856✔ GLY_000857✔ GLY_000858✔ GLY_000860✔ GLY_000862✔ GLY_000863✔ GLY_000864✔ GLY_000865✔ GLY_000867✔ GLY_000869✔ GLY_000870✔ GLY_000871✔ GLY_000872✔ GLY_000873✔ GLY_000874✔ GLY_000875✔ GLY_000876✔ GLY_000877✔ GLY_000878✔ GLY_000879✔ GLY_000880✔ GLY_000884✔ GLY_000894✔ GLY_000906✔ GLY_000907✔ GLY_000908✔ GLY_000909✔ GLY_000910✔ GLY_000911✔ GLY_000912✔ GLY_000914✔ GLY_000916✔ GLY_000917✔ GLY_000919✔ GLY_000937✔ GLY_000941✔ GLY_000942✔ GLY_000944✔ GLY_000945✔ GLY_000947✔ GLY_000948✔ GLY_000949✔ GLY_000951✔ GLY_000952✔

Update according to the following template: "The {dataset name} dataset contains {organism scientific name} [taxid:{taxid}] UniProtKB canonical accessions...."

And delete the sentence: "The dataset is derived from NCBI RefSeq Release 96, September 9, 2019." Also, if you see this sentence, delete it: "The RefSeq accessions are The dataset is derived from NCBI RefSeq Release 96, September 9, 2019"

GLY_000021✔ GLY_000022✔ GLY_000133✔ GLY_000134✔ GLY_000235✔ GLY_000249✔ GLY_000256✔ GLY_000264✔ GLY_000275✔ GLY_000387✔ GLY_000388✔ GLY_000389✔ GLY_000392✔ GLY_000393✔ GLY_000394✔ GLY_000405✔ GLY_000406✔ GLY_000407✔ GLY_000437✔ GLY_000554✔ GLY_000613✔ GLY_000614✔ GLY_000615✔ GLY_000645✔ GLY_000758✔ GLY_000839✔ GLY_000845✔ GLY_000852✔ GLY_000904✔ GLY_000913✔ GLY_000918✔ GLY_000932✔ GLY_000946✔

Luke-Johnson-5 commented 7 months ago

@kmartinez834

for HCV1a and HCV1b what scientific name should I use?

I know these are not scientific names but do these work: HCV1a - Hepatitis C virus (genotype 1a, isolate H) HCV1b - Hepatitis C virus (genotype 1b, isolate Japanese)

kmartinez834 commented 7 months ago

Yes, please use the "long_name" from the file generated/misc/species_info.csv:

tax_id,short_name,long_name,common_name,nt_file,is_reference,sort_order
9606,human,Homo sapiens,Human,uniprot-proteome-homo-sapiens.nt,yes,1
10090,mouse,Mus musculus,Mouse,uniprot-proteome-mus-musculus.nt,yes,2
10116,rat,Rattus norvegicus,Rat,uniprot-proteome-rattus-norvegicus.nt,yes,3
63746,hcv1a,Hepatitis C virus (isolate tax_id,short_name,long_name,common_name,nt_file,is_reference,sort_order
9606,human,Homo sapiens,Human,uniprot-proteome-homo-sapiens.nt,yes,1
10090,mouse,Mus musculus,Mouse,uniprot-proteome-mus-musculus.nt,yes,2
10116,rat,Rattus norvegicus,Rat,uniprot-proteome-rattus-norvegicus.nt,yes,3
63746,hcv1a,Hepatitis C virus (isolate H),HCV-H,uniprot-proteome-hepatitis-c-virus-1a.nt,yes,4
11116,hcv1b,Hepatitis C virus (isolate Japanese),HCV-Japanese,uniprot-proteome-hepatitis-c-virus-1b.nt,yes,5
694009,sarscov1,Severe acute respiratory syndrome-related coronavirus,HCoV-SARS,uniprot-proteome-sars-coronavirus.nt,yes,6
2697049,sarscov2,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,uniprot-proteome-sars-cov-2.nt,yes,7
7227,fruitfly,Drosophila melanogaster,Fruit fly,uniprot-proteome-drosophila-melanogaster.nt,yes,8
559292,yeast,Saccharomyces cerevisiae S288C,Yeast,uniprot-proteome-saccharomyces-cerevisiae.nt,yes,9
44689,dicty,Dictyostelium discoideum,Cellular slime molds,uniprot-proteome-dictyostelium-discoideum.nt,yes,10
9823,pig,Sus scrofa,Pig,uniprot-proteome-sus_scrofa.nt,yes,11H),HCV-H,uniprot-proteome-hepatitis-c-virus-1a.nt,yes,4
11116,hcv1b,Hepatitis C virus (isolate Japanese),HCV-Japanese,uniprot-proteome-hepatitis-c-virus-1b.nt,yes,5
694009,sarscov1,Severe acute respiratory syndrome-related coronavirus,HCoV-SARS,uniprot-proteome-sars-coronavirus.nt,yes,6
2697049,sarscov2,Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,uniprot-proteome-sars-cov-2.nt,yes,7
7227,fruitfly,Drosophila melanogaster,Fruit fly,uniprot-proteome-drosophila-melanogaster.nt,yes,8
559292,yeast,Saccharomyces cerevisiae S288C,Yeast,uniprot-proteome-saccharomyces-cerevisiae.nt,yes,9
44689,dicty,Dictyostelium discoideum,Cellular slime molds,uniprot-proteome-dictyostelium-discoideum.nt,yes,10
9823,pig,Sus scrofa,Pig,uniprot-proteome-sus_scrofa.nt,yes,11
CyrusAY commented 7 months ago

Task completed @kmartinez834 please close the ticket