glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Associated Protein number discrepancy on super search and motif list page. #199

Closed sujeetvkulkarni closed 1 year ago

sujeetvkulkarni commented 1 year ago

If you search with Glycan Motif ID - GGM.000001 on super search you get 173 associated proteins but if you see on the No of proteins on motif list (https://www.glygen.org/list-of-motifs/) for GGM.000001 you see 151. Can you please look into the discrepancy.

rykahsay commented 1 year ago

@ReneRanzinger, @kmartinez834 --> answer to first question: current export file from Nathan shows only 120 motif IDs

cat downloads/glytoucan/current/export/allmotifs.tsv | awk '{print $1}' | grep -v MotifAccession|sort -u |wc 120 120 1320

rykahsay commented 1 year ago

Here is my explanation --

On the supersearch page: GGM.000001 is in 8990 glycans which glycosylate 199 non-ambiguous sites -- which are known ranges on 88 proteins. In other words, the association of "motif" to "protein" goes through "site", and the site objects we have are only for non-ambiguous sites. The other 85 proteins are connected to the motif through the "motif-glycan-enzyme/protein" path.

On the motif list page: GGM.000001 is in 8990 glycans which glycosylate total of 151 protein sequences (88 of these proteins have non-ambiguous sites, and the remaining 63 have ambiguous sites with unknown position. This page is not considering proteins that are associated to the 8990 glycans through enzyme (which I will fix soon).

In the next version, may be we can represent ambiguous sites using start_pos=1 and end_pos=sequence_length

rykahsay commented 1 year ago

For 2.1 release, this is what we have now:

a) GGM.000001 --> associated with 12848. glycans --> associated with 85 enzymes/proteins b) GGM.000001 --> associated with 12848. glycans --> associated with 150 glycoproteins b1) GGM.000001 --> associated with 12848. glycans --> associated with 87 glycoproteins with known glycosylation sites b2) GGM.000001 --> associated with 12848. glycans --> associated with 105 glycoproteins with unknown glycosylation sites

When you do supersearch, what you get is (a) + (b1) which is 85 + 87 = 172 since there is no representation of uknown sites

image
sujeetvkulkarni commented 1 year ago

@ReneRanzinger @rykahsay Some questions :

  1. 87(known sites - b1) + 105 (unknown sites - b2) dont add up to 235 on motif list page (screenshot below - https://beta.glygen.org/list-of-motifs/).
  2. b1 (known sites) + b2 (unknown sites) should be on site to protein edge (instead of 87).
Screenshot 2023-06-28 at 9 14 54 PM
ReneRanzinger commented 1 year ago

@rykahsay for 2.1 we will change the motif page so that only the number of proteins will be shown that are glycosylated with a glycan that caries that motif. No proteins from the enzyme context or the binding context.

rykahsay commented 1 year ago

GGM.000001 --> associated with 12848. glycans --> associated with 150 glycoproteins

Out of these 150 proteins: 45 of them bear glycans at "only know sites" (all glycans associated with these proteins are at known sites) 63 of them bear glycans at "only unknown sites" (all glycans associated with these proteins are at unknown sites) 42 of them bear glycans at both "known and unknown sites" (some glycans associated with these proteins are at known sites and others are at unknown sites)

rykahsay commented 1 year ago

Fixed on beta

image