glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

2.6 Dataset Plan #1450

Closed ubhuiyan closed 11 minutes ago

ubhuiyan commented 2 weeks ago

New Datasets / Requires new object in dataset-masterlist.json:

Protein Datasets (check reviewed list for chicken)

arabidopsis_protein_allsequences.fasta arabidopsis_protein_altnames.csv arabidopsis_protein_binary_interactions.csv arabidopsis_protein_canonicalsequences.fasta arabidopsis_protein_citations_reactome.csv arabidopsis_protein_citations_refseq.csv arabidopsis_protein_citations_uniprotkb.csv arabidopsis_protein_enzyme_annotation_uniprotkb.csv arabidopsis_protein_function_refseq.csv arabidopsis_protein_function_uniprotkb.csv arabidopsis_protein_genelocus.csv arabidopsis_protein_genenames_refseq.csv arabidopsis_protein_genenames_uniprotkb.csv arabidopsis_protein_glycohydrolase.csv arabidopsis_protein_glycosylation_motifs.csv arabidopsis_protein_glycosyltransferase.csv arabidopsis_protein_go_annotation.csv arabidopsis_protein_info_refseq.csv arabidopsis_protein_info_uniprotkb.csv arabidopsis_protein_masterlist.csv arabidopsis_protein_ncbi_linkouts.csv arabidopsis_protein_ntdata.nt arabidopsis_protein_participants_reactome.csv arabidopsis_protein_participants_rhea.csv arabidopsis_protein_pathways_reactome.csv arabidopsis_protein_pro_annotation.csv arabidopsis_protein_proteinnames_refseq.csv arabidopsis_protein_ptm_annotation_uniprotkb.csv arabidopsis_protein_reactions_reactome.csv arabidopsis_protein_reactions_rhea.csv arabidopsis_protein_recnames.csv arabidopsis_protein_sequenceinfo.csv arabidopsis_protein_signalp_annotation.csv arabidopsis_protein_signalp_cleavedsequences.fasta arabidopsis_protein_signalp_fullsequences.fasta arabidopsis_protein_signalp_peptidesequences.fasta arabidopsis_protein_site_annotation_uniprotkb.csv arabidopsis_protein_submittednames.csv arabidopsis_protein_transcriptlocus.csv arabidopsis_protein_xref_bgee.csv arabidopsis_protein_xref_brenda.csv arabidopsis_protein_xref_cazy.csv arabidopsis_protein_xref_cdd.csv arabidopsis_protein_xref_chembl.csv arabidopsis_protein_xref_geneid.csv arabidopsis_protein_xref_glyconnect.csv arabidopsis_protein_xref_intact.csv arabidopsis_protein_xref_interpro.csv arabidopsis_protein_xref_kegg.csv arabidopsis_protein_xref_oglcnac_atlas.csv arabidopsis_protein_xref_oma.csv arabidopsis_protein_xref_orthodb.csv arabidopsis_protein_xref_panther.csv arabidopsis_protein_xref_pdb.csv arabidopsis_protein_xref_pfam.csv arabidopsis_protein_xref_pro.csv arabidopsis_protein_xref_reactome.csv arabidopsis_protein_xref_refseq.csv arabidopsis_protein_xref_rhea.csv arabidopsis_protein_xref_uniprotkb.csv arabidopsis_protein_isoform_alignments.aln arabidopsis_protein_xref_oglcnac_mcw.csv chicken_protein_xref_oglcnac_mcw.csv human_protein_disease_alliance_genome.csv

Proteoform Datasets (same thing)

arabidopsis_proteoform_citations_glycation_sites_uniprotkb.csv arabidopsis_proteoform_citations_glycosylation_sites_oglcnac_atlas.csv arabidopsis_proteoform_citations_glycosylation_sites_oglcnac_mcw.csv arabidopsis_proteoform_citations_glycosylation_sites_uniprotkb.csv arabidopsis_proteoform_citations_glycosylation_sites_literature_mining.csv arabidopsis_proteoform_citations_glycosylation_sites_glyconnect.csv arabidopsis_proteoform_citations_phosphorylation_sites_iptmnet.csv arabidopsis_proteoform_citations_phosphorylation_sites_uniprotkb.csv arabidopsis_proteoform_glycosylation_sites_glyconnect.csv arabidopsis_proteoform_glycosylation_sites_literature_mining.csv arabidopsis_proteoform_glycosylation_sites_literature_mining_manually_verified.csv arabidopsis_proteoform_glycosylation_sites_oglcnac_atlas.csv arabidopsis_proteoform_glycosylation_sites_oglcnac_mcw.csv arabidopsis_proteoform_glycosylation_sites_pdb.csv arabidopsis_proteoform_glycosylation_sites_uniprotkb.csv arabidopsis_proteoform_phosphorylation_sites_iptmnet.csv arabidopsis_proteoform_phosphorylation_sites_uniprotkb.csv chicken_proteoform_citations_glycosylation_sites_glyconnect.csv

Changing Datasets

katewarner commented 2 weeks ago

@jeet-vora and @kmartinez834 please review the 2.6 Dataset Plan

kmartinez834 commented 1 week ago
katewarner commented 1 week ago

@kmartinez834 @ubhuiyan I made Karina's corrections

jeet-vora commented 1 week ago

@katewarner @ubhuiyan FYI The source files does not have data for new organisms and hence these files would be empty. In future these files should not be part of the plan.

arabidopsis_proteoform_citations_glycosylation_sites_literature_mining.csv arabidopsis_proteoform_glycosylation_sites_literature_mining.csv arabidopsis_proteoform_glycosylation_sites_literature_mining_manually_verified.csv arabidopsis_proteoform_glycosylation_sites_pdb.csv

For Changed name of _protein_xref_oglcnac_db.csv to _protein_xref_oglcnac_mcw.csv (all other datasets for this resource use "oglcnac_mcw" - using "db" was unclear if this was associated with atlas or mcw oglcnac)

I would not change it to _protein_xref_oglcnac_mcw.csv because the _protein_xref_oglcnac_db.csv was done on purpose as the resource has a portal and called The O-GlcNAc DB. The glycosylation data files should rather be changed to _oglcnac_db.csv than _oglcnac_mcw.csv. Robel would need to make changes at his end.