Open kiyoko opened 8 years ago
wurcsrdf is using a different glycosequence uri.
this will be resolved in https://github.com/glytoucan/glytoucan.github.io/issues/55
PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX glytoucan: <http://www.glytoucan.org/glyco/owl/glytoucan#>
SELECT distinct ?Seq
?gseq
from <http://rdf.glytoucan.org/core>
from <http://rdf.glytoucan.org/sequence/wurcs>
WHERE { ?s a glycan:saccharide .
?s glytoucan:has_primary_id "G06216SX" .
?s glycan:has_glycosequence ?gseq .
?gseq glycan:has_sequence ?Seq .
?gseq glycan:in_carbohydrate_format glycan:carbohydrate_format_wurcs }
a quick way to fix this, would be to make sure the results are distinct and also to remove the glycosequence uri from the select clause.
If you comment out the above ?gseq it will return one row. @shinmachi does this make sense?
reviewed the select for the main_list stanza - it does not select from the glycosequenceURI
main_list stanza sparql query following and add value G06216SX.
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX glytoucan: <http://www.glytoucan.org/glyco/owl/glytoucan#>
SELECT ?AccessionNumber ?WURCSLabel ?GlycoCT ?Mass ?MassLabel ?Contributor ?ContributionTime ?MotifNames
WHERE{
{
SELECT DISTINCT ?AccessionNumber ?WURCSLabel ?GlycoCT ?Mass ?MassLabel ?Contributor ?ContributionTime ((GROUP_CONCAT (DISTINCT ?MotifName, ', ')) as ?MotifNames)
FROM <http://rdf.glytoucan.org/core>
FROM NAMED <http://rdf.glytoucan.org/mass>
FROM <http://rdf.glytoucan.org/sequence/wurcs>
FROM <http://rdf.glytoucan.org/sequence/glycoct>
FROM <http://rdf.glytoucan.org/users>
FROM <http://rdf.glytoucan.org/ms/carbbank>
FROM <http://rdf.glytoucan.org/motif>
WHERE {
# repository RDF
VALUES ?AccessionNumber {"G06216SX"}
# Accession Number
?glycan glytoucan:has_primary_id ?AccessionNumber .
# WURCS
OPTIONAL{
?glycan glycan:has_glycosequence ?wcsSeq .
?wcsSeq rdfs:label ?WURCSLabel .
?wcsSeq glycan:in_carbohydrate_format glycan:carbohydrate_format_wurcs .
}
# GlycoCT
OPTIONAL{
?glycan glycan:has_glycosequence ?gctSeq .
?gctSeq glycan:has_sequence ?GlycoCT .
?gctSeq glycan:in_carbohydrate_format glycan:carbohydrate_format_glycoct .
}
# Mass
# a repeat structure dont have mass value
OPTIONAL{
GRAPH <http://rdf.glytoucan.org/mass>{
?glycan glytoucan:has_derivatized_mass ?dmass .
?dmass rdfs:label ?MassLabel .
OPTIONAL{
?dmass glytoucan:has_mass ?Mass .
}
}}
# Motif
OPTIONAL{
?glycan glycan:has_motif ?motif .
?motif rdfs:label ?MotifName .
}
# Contributor
OPTIONAL{
?glycan glycan:has_resource_entry ?res .
?res glytoucan:date_registered ?ContributionTime ;
glytoucan:contributor ?c .
?c foaf:name ?Contributor .
}
}
ORDER BY ?AccessionNumber
}
}
OFFSET 20
LIMIT 20
Two type of wurcs and mass label patterns are caused duplication.
ok this shows to be another reason to prioritize https://github.com/glytoucan/glytoucan.github.io/issues/55
however the bottleneck for this is we have to redesign the registration workflow, which will require a lot more work. - flagged as Help wanted - need a resource to manage this analysis and developement
When registration redesign is complete, it would be possible to easily wipe out the wurcs enrichment data (separate from the wurcs core data - the sequence) and re-run the enrichment batch.
currently we dont have this infrastructure in place.
On Thu, Aug 18, 2016 at 5:36 PM, Daisuke shinmachi <notifications@github.com
wrote:
Two type of wurcs and mass label patterns are caused duplication.
[image: 2016-08-18 17 24 01] https://cloud.githubusercontent.com/assets/4487545/17766971/daf9d1b2-6568-11e6-98f2-d1fcdd346207.png
[image: 2016-08-18 17 28 26] https://cloud.githubusercontent.com/assets/4487545/17767089/5e6ca2cc-6569-11e6-8f7c-4fc659e5a066.png
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/glytoucan/glytoucan.github.io/issues/50#issuecomment-240659269, or mute the thread https://github.com/notifications/unsubscribe-auth/ACwD9jntzxZ3BJbRsWQQ_19kPCKHiY2nks5qhBl6gaJpZM4JBPaw .
I have the same opinion as nobu.
There are two WURCSs but these are same structures. This problem is related to expansion of repeating unit having known repeat count.
In WURCS 2.0, repeating units having known repeat count (and no range) must be expanded and the repeat information must be removed. The expansion method is implemented as WURCS normalization system in WURCSFramework and now available. However, the unexpanded (not standard) WURCS is also stored. Presence of two mass information is also from same reason. Mass of the WURCS containing repeating unit cannot be calculated even if the repeat count is known and have not range.
So I think this is a problem of the structure data storage in the registration workflow.
When I sort by largest mass, the top two structures are the same: G06216SX