glycans with same accession number shown twice

kiyoko commented 8 years ago

When I sort by largest mass, the top two structures are the same: G06216SX 2016-06-30 00 09 01

aokinobu commented 7 years ago

wurcsrdf is using a different glycosequence uri.

this will be resolved in https://github.com/glytoucan/glytoucan.github.io/issues/55

PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX glytoucan: <http://www.glytoucan.org/glyco/owl/glytoucan#>
SELECT distinct ?Seq 
?gseq
from <http://rdf.glytoucan.org/core>
from <http://rdf.glytoucan.org/sequence/wurcs>
WHERE { ?s a glycan:saccharide .
       ?s glytoucan:has_primary_id "G06216SX" .
       ?s glycan:has_glycosequence ?gseq .
       ?gseq glycan:has_sequence ?Seq .
       ?gseq glycan:in_carbohydrate_format glycan:carbohydrate_format_wurcs }

aokinobu commented 7 years ago

a quick way to fix this, would be to make sure the results are distinct and also to remove the glycosequence uri from the select clause.

If you comment out the above ?gseq it will return one row. @shinmachi does this make sense?

aokinobu commented 7 years ago

reviewed the select for the main_list stanza - it does not select from the glycosequenceURI

shinmachi commented 7 years ago

main_list stanza sparql query following and add value G06216SX.

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX glycan:  <http://purl.jp/bio/12/glyco/glycan#>
PREFIX glytoucan:  <http://www.glytoucan.org/glyco/owl/glytoucan#>
SELECT ?AccessionNumber ?WURCSLabel ?GlycoCT ?Mass ?MassLabel ?Contributor ?ContributionTime ?MotifNames
WHERE{
  {
    SELECT DISTINCT ?AccessionNumber ?WURCSLabel ?GlycoCT ?Mass ?MassLabel ?Contributor ?ContributionTime ((GROUP_CONCAT (DISTINCT ?MotifName, ', ')) as ?MotifNames)
      FROM <http://rdf.glytoucan.org/core>
      FROM NAMED <http://rdf.glytoucan.org/mass>
      FROM <http://rdf.glytoucan.org/sequence/wurcs>
      FROM <http://rdf.glytoucan.org/sequence/glycoct>
      FROM <http://rdf.glytoucan.org/users>
      FROM <http://rdf.glytoucan.org/ms/carbbank>
      FROM <http://rdf.glytoucan.org/motif>
    WHERE {
      # repository RDF
        VALUES ?AccessionNumber {"G06216SX"}
        # Accession Number
        ?glycan glytoucan:has_primary_id ?AccessionNumber .
        # WURCS
        OPTIONAL{
          ?glycan glycan:has_glycosequence ?wcsSeq .
          ?wcsSeq rdfs:label ?WURCSLabel .
          ?wcsSeq glycan:in_carbohydrate_format glycan:carbohydrate_format_wurcs .
        }
        # GlycoCT
        OPTIONAL{
          ?glycan glycan:has_glycosequence ?gctSeq .
          ?gctSeq glycan:has_sequence ?GlycoCT .
          ?gctSeq glycan:in_carbohydrate_format glycan:carbohydrate_format_glycoct .
        }
        # Mass
        # a repeat structure dont have mass value
        OPTIONAL{
        GRAPH <http://rdf.glytoucan.org/mass>{
          ?glycan glytoucan:has_derivatized_mass ?dmass .
          ?dmass rdfs:label ?MassLabel .
          OPTIONAL{
            ?dmass glytoucan:has_mass ?Mass .
          }
        }}
        # Motif
        OPTIONAL{
          ?glycan glycan:has_motif ?motif .
          ?motif rdfs:label ?MotifName .
        }
        # Contributor
        OPTIONAL{
          ?glycan glycan:has_resource_entry ?res .
          ?res glytoucan:date_registered ?ContributionTime ;
               glytoucan:contributor ?c .
          ?c foaf:name ?Contributor .
        }
    }
    ORDER BY ?AccessionNumber
  }
}
OFFSET 20
LIMIT 20

shinmachi commented 7 years ago

Two type of wurcs and mass label patterns are caused duplication.

2016-08-18 17 24 01

2016-08-18 17 28 26

aokinobu commented 7 years ago

ok this shows to be another reason to prioritize https://github.com/glytoucan/glytoucan.github.io/issues/55

however the bottleneck for this is we have to redesign the registration workflow, which will require a lot more work. - flagged as Help wanted - need a resource to manage this analysis and developement

When registration redesign is complete, it would be possible to easily wipe out the wurcs enrichment data (separate from the wurcs core data - the sequence) and re-run the enrichment batch.

currently we dont have this infrastructure in place.

On Thu, Aug 18, 2016 at 5:36 PM, Daisuke shinmachi <notifications@github.com

wrote:

Two type of wurcs and mass label patterns are caused duplication.

[image: 2016-08-18 17 24 01] https://cloud.githubusercontent.com/assets/4487545/17766971/daf9d1b2-6568-11e6-98f2-d1fcdd346207.png

[image: 2016-08-18 17 28 26] https://cloud.githubusercontent.com/assets/4487545/17767089/5e6ca2cc-6569-11e6-8f7c-4fc659e5a066.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/glytoucan/glytoucan.github.io/issues/50#issuecomment-240659269, or mute the thread https://github.com/notifications/unsubscribe-auth/ACwD9jntzxZ3BJbRsWQQ_19kPCKHiY2nks5qhBl6gaJpZM4JBPaw .

MasaakiMatsubara commented 7 years ago

I have the same opinion as nobu.

There are two WURCSs but these are same structures. This problem is related to expansion of repeating unit having known repeat count.

In WURCS 2.0, repeating units having known repeat count (and no range) must be expanded and the repeat information must be removed. The expansion method is implemented as WURCS normalization system in WURCSFramework and now available. However, the unexpanded (not standard) WURCS is also stored. Presence of two mass information is also from same reason. Mass of the WURCS containing repeating unit cannot be calculated even if the repeat count is known and have not range.

So I think this is a problem of the structure data storage in the registration workflow.

glytoucan / glytoucan.github.io

glycans with same accession number shown twice #50