glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

need to define sort keys for columns of expression_tissue and expression_cell_line table. #614

Closed sujeetvkulkarni closed 8 months ago

sujeetvkulkarni commented 9 months ago

need to define sort keys for columns of expression_tissue and expression_cell_line table. Both glycan details and table pagination api should support these sort keys.

sujeetvkulkarni commented 9 months ago

Also,

API : https://api.tst.glygen.org/pagination/page/

For table expression_tissue, If we sort by keys "start_pos", "uniprot_canonical_ac", tissue -> "namespace", tissue -> "id", we get empty "results": [] array.

{
  "record_type": "glycan",
  "table_id": "expression_tissue",
  "record_id": "G17689DH",
  "offset": 1,
  "limit": 20,
  "order": "desc",
  "sort": "uniprot_canonical_ac"
}

{
    "query": {
        "record_type": "glycan",
        "table_id": "expression_tissue",
        "record_id": "G17689DH",
        "offset": 1,
        "limit": 20,
        "order": "desc",
        "sort": "uniprot_canonical_ac"
    },
    "results": []
}

For table expression_cell_line, If we sort by keys "start_pos", "uniprot_canonical_ac", cell_line -> "namespace", cell_line -> "id", we get empty "results": [] array.

{
  "record_type": "glycan",
  "table_id": "expression_cell_line",
  "record_id": "G17689DH",
  "offset": 1,
  "limit": 20,
  "order": "desc",
  "sort": "uniprot_canonical_ac"
}

{
    "query": {
        "record_type": "glycan",
        "table_id": "expression_cell_line",
        "record_id": "G17689DH",
        "offset": 1,
        "limit": 20,
        "order": "desc",
        "sort": "uniprot_canonical_ac"
    },
    "results": []
}
sujeetvkulkarni commented 9 months ago

Backend need to send results even in case a sort column has no values in it and user sorts it.

rykahsay commented 9 months ago

Please try now

sujeetvkulkarni commented 9 months ago

API: https://api.tst.glygen.org/glycan/detail/G80966KZ

For table id = expression_tissue sort fields : "start_pos", "uniprot_canonical_ac" are working. Can you please tell me what sort keys to use for tissue -> "namespace" and tissue -> "id"

expression: [
...
{
  "uniprot_canonical_ac": "P01024-1",
  "start_pos": 85,
  "end_pos": 85,
  "residue": "Asn",
  "category": "tissue",
  "tissue": {
    "name": "milk",
    "namespace": "UBERON",
    "id": "0001913",
    "url": "http://purl.obolibrary.org/obo/UBERON_0001913"
  },
  "evidence": [
    {
      "id": "2039",
      "database": "GlyConnect",
      "url": "https://glyconnect.expasy.org/browser/structures/2039"
    },
    {
      "id": "110",
      "database": "GlyConnect",
      "url": "https://glyconnect.expasy.org/browser/proteins/110"
    },
    {
      "id": "32125861",
      "database": "PubMed",
      "url": "https://glygen.org/publication/PubMed/32125861"
    }
  ]
}
...
]

Can you please also give a glycan id where cell_line information is present( "category": "cell_line" in expression: [] array and let us know what sort keys to use for cell_line -> "namespace" and cell_line -> "id".

Also, backend is sending expression_tissue:[] array in https://api.tst.glygen.org/glycan/detail/G80966KZ API which frontend is not using.

rykahsay commented 8 months ago

I couldn't find any glycan with expression in cell_line -- @kmartinez834 can you please verify

sujeetvkulkarni commented 8 months ago

Again getting (table id = expression_tissue) empty result array for pagination API.

API: https://api.tst.glygen.org/glycan/detail/G80966KZ API: https://api.tst.glygen.org//pagination/page/

For table id = expression_tissue sort fields : "start_pos", "uniprot_canonical_ac", "namespace' are returning empty results: [] array.

kmartinez834 commented 8 months ago

Looks like the some of the data isn't making it to the API...

The following proteoform datasets have cell line information associated with glycan ac's:

/data/projects/glygen/generated/datasets/unreviewed/mouse_proteoform_glycosylation_sites_oglcnac_atlas.csv
/data/projects/glygen/generated/datasets/unreviewed/hcv1a_proteoform_glycosylation_sites_literature.csv
/data/projects/glygen/generated/datasets/unreviewed/rat_proteoform_glycosylation_sites_oglcnac_atlas.csv
/data/projects/glygen/generated/datasets/unreviewed/human_proteoform_glycosylation_sites_oglcnac_atlas.csv
/data/projects/glygen/generated/datasets/unreviewed/rat_proteoform_glycosylation_sites_glyconnect.csv
/data/projects/glygen/generated/datasets/unreviewed/mouse_proteoform_glycosylation_sites_glyconnect.csv
/data/projects/glygen/generated/datasets/unreviewed/sarscov1_proteoform_glycosylation_sites_literature.csv
/data/projects/glygen/generated/datasets/unreviewed/fruitfly_proteoform_glycosylation_sites_oglcnac_atlas.csv
/data/projects/glygen/generated/datasets/unreviewed/sarscov2_proteoform_glycosylation_sites_glyconnect.csv
/data/projects/glygen/generated/datasets/unreviewed/sarscov2_proteoform_glycosylation_sites_unicarbkb.csv
/data/projects/glygen/generated/datasets/unreviewed/human_proteoform_glycosylation_sites_glyconnect.csv
/data/projects/glygen/generated/datasets/unreviewed/fruitfly_proteoform_glycosylation_sites_glyconnect.csv
/data/projects/glygen/generated/datasets/unreviewed/human_proteoform_glycosylation_sites_unicarbkb.csv

And these have tissue information w/ glycan ac's:

/data/projects/glygen/generated/datasets/unreviewed/mouse_proteoform_glycosylation_sites_oglcnac_atlas.csv
/data/projects/glygen/generated/datasets/unreviewed/rat_proteoform_glycosylation_sites_oglcnac_atlas.csv
/data/projects/glygen/generated/datasets/unreviewed/human_proteoform_glycosylation_sites_oglcnac_atlas.csv
/data/projects/glygen/generated/datasets/unreviewed/rat_proteoform_glycosylation_sites_unicarbkb.csv
/data/projects/glygen/generated/datasets/unreviewed/human_proteoform_glycosylation_sites_unicarbkb_glycomics_study.csv
/data/projects/glygen/generated/datasets/unreviewed/rat_proteoform_glycosylation_sites_glyconnect.csv
/data/projects/glygen/generated/datasets/unreviewed/mouse_proteoform_glycosylation_sites_glyconnect.csv
/data/projects/glygen/generated/datasets/unreviewed/rat_proteoform_glycosylation_sites_unicarbkb_glycomics_study.csv
/data/projects/glygen/generated/datasets/unreviewed/fruitfly_proteoform_glycosylation_sites_oglcnac_atlas.csv
/data/projects/glygen/generated/datasets/unreviewed/sarscov2_proteoform_glycosylation_sites_unicarbkb.csv
/data/projects/glygen/generated/datasets/unreviewed/human_proteoform_glycosylation_sites_glyconnect.csv
/data/projects/glygen/generated/datasets/unreviewed/fruitfly_proteoform_glycosylation_sites_glyconnect.csv
/data/projects/glygen/generated/datasets/unreviewed/human_proteoform_glycosylation_sites_unicarbkb.csv
"uniprotkb_canonical_ac","glycosylation_site_uniprotkb","amino_acid","saccharide","glycosylation_type","xref_key","xref_id","src_xref_key","src_xref_id","uniprotkb_ac","evidence","genbank_accession_nucleotide_from_paper","genbank_accession_nucleotide_version_from_paper","genbank_accession_protein_version","protein_name_genbank","protein_rec_name_uniprot","tax_id_uniprotkb_ac","organism","strain_uniprotkb_ac","glycosylation_site_in_paper","link_sugar","glycan_composition_in_paper","glycan_composition_format_1","glycan_composition_format_2","core_type","glycoCT","oxford_notation","abundance_from_paper","glycopeptide_sequence","abundance_normalized","predominant_glycan_species","biological_source","source_cell_line_cellosaurus_name","source_cell_line_cellosaurus_id","analyte","mass_glycopeptide","chromatography_glycopeptide","analyzer_glycopeptide","sample_preparation_glycopeptide","glycosidase_treatment_glycopeptide","lectin_characterisation_glycopeptide","fragmentation_glycopeptide","ionization_glycopeptide","notes","entry_version_uniprot","entry_modification_date_refseq","n_sequon","n_sequon_type","start_pos","end_pos","start_aa","end_aa","site_seq"
"P27958-1","448","Asn","G92050GC","N-linked","protein_xref_pubmed","18187336","protein_xref_glygen_ds","GLY_000335","","18187336","AF009606","AF009606.1","AAB66324.1","polyprotein [Hepatitis C virus subtype 1a]","Genome polyprotein","63746","Hepacivirus C","Hepatitis C virus subtype 1a (Isolate H)","66","GlcNAc","Man4","Hex4HexNAc2","HexNAc2Hex4dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","High mannose","","M4","0.61","FNSSGCPER","14.59","high mannose glycans","","CHO","CVCL_0213","glycopeptide","2106.763","HPLC","Q-Tof hybrid","Carboxymethylation | tryptic digest","","","CAD","MALDI","The notation ManX, where X ranges from 4 to 9 in the case of the observed tryptic and chymotryptic glycopeptides, indicates that X mannose residues are attached to the chitobiose core (GlcNAc ?(1-4) GlcNAc) | the differences in the number of mannose residues is caused by the slight differences between the E2 glycoprotein batches that were used. The majority of these sites proved to be occupied by high mannose glycans. The relative abundance was determined from deconvoluted mass spectrum over the mass range containing the glycopeptide ions corresponding to glycopeptides with the same amino acid sequence. See glycopeptide sequence reported.","43509","39982","NSS","NXS","448","448","Asn","Asn","N"

https://api.tst.glygen.org/glycan/detail/G92050GC --> missing cell line (CVCL_0213)

"expression": [{
    "uniprot_canonical_ac": "P01830-1",
    "start_pos": 42,
    "end_pos": 42,
    "residue": "Asn",
    "category": "tissue",
    "tissue": {
        "name": "Synaptosomes",
        "namespace": "OMIT",
        "id": "0014437",
        "url": "http://purl.obolibrary.org/obo/OMIT_0014437"
    },
    "evidence": [{
        "id": "P01830",
        "database": "UniCarbKB"
    }, {
        "id": "34106099",
        "database": "PubMed",
        "url": "https://glygen.org/publication/PubMed/34106099"
    }, {
        "id": "10.1039/D0MO00044B",
        "database": "DOI",
        "url": "https://glygen.org/publication/DOI/10.1039/D0MO00044B"
    }]
}, {
    "uniprot_canonical_ac": "P13638-1",
    "start_pos": 118,
    "end_pos": 118,
    "residue": "Asn",
    "category": "tissue",
    "tissue": {
        "name": "Synaptosomes",
        "namespace": "OMIT",
        "id": "0014437",
        "url": "http://purl.obolibrary.org/obo/OMIT_0014437"
    },
    "evidence": [{
        "id": "P13638",
        "database": "UniCarbKB"
    }, {
        "id": "34106099",
        "database": "PubMed",
        "url": "https://glygen.org/publication/PubMed/34106099"
    }, {
        "id": "10.1039/D0MO00044B",
        "database": "DOI",
        "url": "https://glygen.org/publication/DOI/10.1039/D0MO00044B"
    }]
}, {
    "uniprot_canonical_ac": "P45479-1",
    "start_pos": 197,
    "end_pos": 197,
    "residue": "Asn",
    "category": "tissue",
    "tissue": {
        "name": "Synaptosomes",
        "namespace": "OMIT",
        "id": "0014437",
        "url": "http://purl.obolibrary.org/obo/OMIT_0014437"
    },
    "evidence": [{
        "id": "P45479",
        "database": "UniCarbKB"
    }, {
        "id": "34106099",
        "database": "PubMed",
        "url": "https://glygen.org/publication/PubMed/34106099"
    }, {
        "id": "10.1039/D0MO00044B",
        "database": "DOI",
        "url": "https://glygen.org/publication/DOI/10.1039/D0MO00044B"
    }]
}],
"uniprotkb_canonical_ac","glycosylation_site_uniprotkb","amino_acid","saccharide","glycosylation_type","xref_key","xref_id","src_xref_key","src_xref_id","protein_id","taxonomy_taxonomy_id","taxonomy_species","structure_id","structure_glytoucan_id","structure_glycan_core","structure_glycan_type","composition_format_numeric","composition_format_condensed","composition_format_byonic","composition_mass_monoisotopic","composition_mass","composition_format_glyconnect","composition_glytoucan_id","source_tissue_name","source_tissue_id","source_cell_line_cellosaurus_name","source_cell_line_cellosaurus_id","source_cell_component_id","source_cell_component_go_id","source_cell_component_name","n_sequon","n_sequon_type","start_pos","end_pos","start_aa","end_aa","site_seq"
"","","","G11457RF","O-linked","protein_xref_glyconnect","357","protein_xref_glyconnect","357","357","10116","Rattus norvegicus","2376","G11457RF","Core 2","O-Linked","2,2,0,1,0,0,0,0,0,0,0,0,0,0","H2N2S1","HexNAc(2)Hex(2)NeuAc(1)","1039.3704","1039.948","Hex:2 HexNAc:2 NeuAc:1","","liver","UBERON:0002107","Zajdela-Hepatoma","CVCL_1D00","123","GO_0009986","Cell Surface","","","","","","",""

https://api.tst.glygen.org/glycan/detail/G11457RF --> missing cell line (CVCL_1D00) and tissue (UBERON:0002107)

{
    "table_id": "expression",
    "table_stats": [{
        "field": "total",
        "count": 0
    }, {
        "field": "total_sites",
        "count": 0
    }]
}, {
    "table_id": "expression_tissue",
    "table_stats": [{
        "field": "total",
        "count": 0
    }, {
        "field": "total_sites",
        "count": 0
    }]
}, {
    "table_id": "expression_cell_line",
    "table_stats": [{
                "field": "total",
                "count": 0
            }
rykahsay commented 8 months ago

The following work now:

{
  "record_type": "glycan",
  "table_id": "expression_cell_line",
  "record_id": "G92050GC",
  "offset": 1,
  "limit": 20,
  "order": "desc",
  "sort": "uniprot_canonical_ac"
}
{
  "record_type": "glycan",
  "table_id": "expression_tissue",
  "record_id": "G92050GC",
  "offset": 1,
  "limit": 20,
  "order": "desc",
  "sort": "uniprot_canonical_ac"
}
kmartinez834 commented 8 months ago

Pagination looks good

@rykahsay @ReneRanzinger just to confirm, are we intentionally omitting glycan expression records that don't have a known protein and/or site?

sujeetvkulkarni commented 8 months ago

@rykahsay It is working as expected. sort fields in section_stats->sort_fields for expression_tissue table starts with cell_line. which should be tissue. But tissue.* works fine only the names in section_stats->sort_fields for expression_tissue table need change.

sujeetvkulkarni commented 8 months ago

Pagination looks good

@rykahsay @ReneRanzinger just to confirm, are we intentionally omitting glycan expression records that don't have a known protein and/or site?

@kmartinez834 can you please give us an example of what data is getting filtered out? Is it backend or frontend filtering the data? Is the data with no cell_line or tissue info getting filtered out?

kmartinez834 commented 8 months ago

@sujeetvkulkarni @rykahsay backend filtering --> there are entries without protein/site in the datasets that are not appearing in the API:

$ grep "G57321FI.*CVCL" reviewed/fruitfly_proteoform_glycosylation_sites_glyconnect.csv
"","","","G57321FI","O-linked","protein_xref_glyconnect","416","protein_xref_glyconnect","416","416","7227","Drosophila melanogaster","2305","G57321FI","Core 0","O-Linked","0,1,0,0,0,0,0,0,0,0,0,0,0,0","N1","HexNAc(1)","221.09","221.2103","HexNAc:1","","mucosa","UBERON:0000344","67j25D","CVCL_Z425","","","","","","","","","",""

--> Also, these known sites have cell_line CVCL_6642 (HEK293-F), but are not included in the API:

reviewed/sarscov2_proteoform_glycosylation_sites_glyconnect.csv:
"P0DTC2-1","1076","Thr","G57321FI"
"P0DTC2-1","1077","Thr","G57321FI"
"P0DTC2-1","1097","Ser","G57321FI"
"P0DTC2-1","73","Thr","G57321FI"
"P0DTC2-1","76","Thr","G57321FI"
"P0DTC2-1","803","Ser","G57321FI"
rykahsay commented 8 months ago

check now

kmartinez834 commented 8 months ago

Known sites are now included.

Entry without protein_ac is still missing --> should this be included?

sujeetvkulkarni commented 8 months ago

@rykahsay It is working as expected. sort fields in section_stats->sort_fields for expression_tissue table starts with cell_line. which should be tissue. But tissue.* works fine only the names in section_stats->sort_fields for expression_tissue table need change.

@rykahsay this problem still exists both on beta and tst. https://beta-api.glygen.org/glycan/detail/G92050GC https://api.tst.glygen.org/glycan/detail/G92050GC

  {
      "table_id": "expression_tissue",
      "table_stats": [
        {
          "field": "total",
          "count": 3
        },
        {
          "field": "total_sites",
          "count": 3
        }
      ],
      "sort_fields": [
        "uniprot_canonical_ac",
        "start_pos",
        "end_pos",
        "residue",
        "category",
        "cell_line.name",
        "cell_line.namespace",
        "cell_line.id",
        "cell_line.url",
        "abundance"
      ]
    },
    {
      "table_id": "expression_cell_line",
      "table_stats": [
        {
          "field": "total",
          "count": 59
        },
        {
          "field": "total_sites",
          "count": 59
        }
      ],
      "sort_fields": [
        "uniprot_canonical_ac",
        "start_pos",
        "end_pos",
        "residue",
        "category",
        "cell_line.name",
        "cell_line.namespace",
        "cell_line.id",
        "cell_line.url",
        "abundance"
      ]
    },
kmartinez834 commented 8 months ago

@sujeetvkulkarni check if protein and position column can be empty in glycan detail #expression and publication detail #expression sections

sujeetvkulkarni commented 8 months ago

@sujeetvkulkarni check if protein and position column can be empty in glycan detail #expression and publication detail #expression sections

2ba0db5effe334798a2b0dfa4b8d999af39938c2 - done.

sujeetvkulkarni commented 8 months ago

@rykahsay you can go ahead and do your changes.

rykahsay commented 8 months ago

Please check

image
rykahsay commented 8 months ago

G92050GC

image
kmartinez834 commented 8 months ago

@sujeetvkulkarni --> glycan expression records with empty protein and site is not working on publication page:

image

https://api.tst.glygen.org/publication/detail/

"glycan_expression": [
    {
      "glytoucan_ac": "G57321FI",
      "tissue": {
        "name": "embryo",
        "namespace": "UBERON",
        "id": "0000922",
        "url": "http://purl.obolibrary.org/obo/UBERON_0000922"
      }
    }
sujeetvkulkarni commented 8 months ago

https://tst.glygen.org/publication/PubMed/16897177#Expression done, please check.

sujeetvkulkarni commented 8 months ago

52f5d24faa9849e9dfec0117670e12954063d631

kmartinez834 commented 8 months ago

Looks good