openphacts / IdentityMappingService

The Identity Mapping Service to combine BridgeDB and the Validator
1 stars 3 forks source link

Archive and load updated RSC linksets #23

Open stain opened 9 years ago

stain commented 9 years ago

Available at http://ops.rsc.org/download/RDF-2015.10.09.zip or as separate resources from http://ops.rsc.org/download/20151009/void_2015-10-09.ttl by following void:dataDump.

TODO: Modify this build job https://github.com/openphacts/ops-rsc-wikipathways-dataset/ -- not sure if this should be one big ops-rsc-dataset, or probably better, one per linkset.

Now easier to download from http://ops.rsc.org/download/ without authentication needed.

stain commented 9 years ago

RDF syntax issue in 20151009.

Tested with riot --validate from Apache Jena 3.0.0

MESH/ISSUES_MESH20151009.ttl.gz

ERROR riot :: [line: 161, col: 47] Bad character in IRI (space): http://purl.bioontology.org/ontology/MSH/...[space]...

      <http://purl.bioontology.org/ontology/MSH/...  (((4-(1,4,5,6R-trans-tetrahydro-2- pyrimidinyl)phenyl)acetyl)amino)-5-thia-> cheminf:CHEMINF_000560 "Contains completely undefined stereo:
  enantiomers"@en .

MESH/LINKSET_EXACT_MESH20151009.ttl.gz

ERROR riot :: [line: 125, col: 95] Bad character in IRI (space): http://purl.bioontology.org/ontology/MSH/...[space]...

Line 125:

      <http://ops.rsc.org/OPS1965918> skos:exactMatch <http://purl.bioontology.org/ontology/MSH/... th 3-(aminocarbonyl)-1-beta-D-ribofuranosylpyridinium hydroxide inner saltN-> .
stain commented 9 years ago

URI mismatch in void:inDataset statements - date changed during data generation?

    stain@biggie:~/Downloads/rsc/20151009/HUMAN_METABOLOME_DATABASE$ riot * |grep inData | cut -d ">" -f 3

     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_parent_child_charge_insensitive_parent_closeMatch
     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_parent_child_isotope_insensitive_parent_closeMatch
     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_parent_child_stereo_insensitive_parent_closeMatch
     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_parent_child_super_insensitive_parent_closeMatch
     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_parent_child_tautomer_insensitive_parent_closeMatch
     <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_exactMatch
     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_ops_chemspider_exactMatch
     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_parent_child_fragment_relatedMatch
     <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#openphacts-human_metabolome_database
     <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#openphacts-human_metabolome_database

while the void says consistently 20151009 or 2015-10-09:

    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_exactMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#exactMatch> .
    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_ops_chemspider_exactMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#exactMatch> .
    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_parent_child_charge_insensitive_parent_closeMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#closeMatch> .
    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_parent_child_fragment_relatedMatch> <http://rdfs.org/ns/void#linkPredicate> <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_parent_child_isotope_insensitive_parent_closeMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#closeMatch> .
    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_parent_child_stereo_insensitive_parent_closeMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#closeMatch> .
    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_parent_child_super_insensitive_parent_closeMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#closeMatch> .
    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_parent_child_tautomer_insensitive_parent_closeMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#closeMatch> .

(Note that void_2015-10-09.ttl here is correct as it is not a .ttl.gz)

stain commented 9 years ago

The VoID has wrong dataDump directory for HMDB and MESH, as they are missing the subfolder names.

:openphacts-human_metabolome_database dcterms:description "The subset of OpenPhacts that contains Human Metabolome Database data."@en;
            dcterms:title "OpenPhacts Human Metabolome Database Subset"@en;
            void:dataDump <http://ops.rsc.org/download/20151009/ISSUES_HUMAN_METABOLOME_DATABASE20151009.ttl.gz>,
                                                          <http://ops.rsc.org/download/20151009/PROPERTIES_HUMAN_METABOLOME_DATABASE20151009.ttl.gz>,
                                                          <http://ops.rsc.org/download/20151009/SYNONYMS_HUMAN_METABOLOME_DATABASE20151009.ttl.gz>;
      :openphacts-mesh dcterms:description "The subset of OpenPhacts that contains MeSH data."@en;
                       dcterms:title "OpenPhacts MeSH Subset"@en;
                       void:dataDump <http://ops.rsc.org/download/20151009/ISSUES_MESH20151009.ttl.gz>,
                                     <http://ops.rsc.org/download/20151009/PROPERTIES_MESH20151009.ttl.gz>,
                                     <http://ops.rsc.org/download/20151009/SYNONYMS_MESH20151009.ttl.gz>;
stain commented 9 years ago

The pav:previousVersion statements in the void points misleadingly to the same version:

:chebi_exactMatch pav:previousVersion :chebi_exactMatch .
:drugbank_exactMatch pav:previousVersion :drugbank_exactMatch .

Should these go to anchors within the previous VoID release under ftp://ops@ftp.rsc-us.org/OPS/ somewhere?

stain commented 9 years ago

Update for RDF-2015.11.04.zip from http://ops.rsc.org/download/RDF-2015.11.04.zip (2.2 GiB, 20 GB unzipped):

I made a Maven job to archive and patch (still building, download speed from http://ops.rsc.org/ are not ideal, seems to be about 5 MBit/s?). Once archived I can use http://repository.mygrid.org.uk/artifactory/ops/org/openphacts/data/ops-rsc-dataset/20151104-SNAPSHOT/ instead, so not a big issue.

MESH errors remain - but the rest of the linksets are all valid Turtle. I added patches to remove the offending lines - this means those URIs won't have a matching links to MESH identifiers.

The void:dataDump links are now updated, but now all of them are 404, e.g.

Simply unpacking the zip file in its current download directory should fix this, which would make http://ops.rsc.org/download/20151104/ work.

I see files now are .ttl instead of .ttl.gz which increases disk space requirement for unzipping by a ten-fold, but I can repackage them in the archival job.