sbmlteam / libCombine

a C++ library for working with the COMBINE Archive format
BSD 2-Clause "Simplified" License
8 stars 5 forks source link

Metadata/Description not read from archives #13

Closed matthiaskoenig closed 7 years ago

matthiaskoenig commented 7 years ago

I think the metadata reading is broken. When trying to read the metadata for the CombineArchiveShowcase one gets empty metadata despite metadata existing for all files.

Output of the test script ./test.sh in the python examples.

********************************************************************************
Print archive: testdata/CombineArchiveShowCase.omex
********************************************************************************
  metadata for '.':
     Created : 2000-01-01T00:00:00Z
     # Creators: 1
       Vasundra Toure
Num Entries: 20
 0: location: ./manifest.xml format: http://identifiers.org/combine.specifications/omex-manifest
  no metadata for './manifest.xml'
 1: location: ./README.md format: http://purl.org/NET/mediatypes/text/x-markdown
  no metadata for './README.md'
 2: location: ./model/BIOMD0000000144.xml format: http://identifiers.org/combine.specifications/sbml.level-2.version-1
  no metadata for './model/BIOMD0000000144.xml'
 3: location: ./model/calzone_2007.ai format: http://purl.org/NET/mediatypes/application/illustrator
  no metadata for './model/calzone_2007.ai'
 4: location: ./model/calzone_2007.png format: http://purl.org/NET/mediatypes/image/png
  no metadata for './model/calzone_2007.png'
 5: location: ./model/calzone_2007.svg format: http://purl.org/NET/mediatypes/image/svg+xml
  no metadata for './model/calzone_2007.svg'
 6: location: ./model/calzone_thieffry_tyson_novak_2007.cellml format: http://identifiers.org/combine.specifications/cellml
  no metadata for './model/calzone_thieffry_tyson_novak_2007.cellml'
 7: location: ./model/sbgn/Calzone2007.sbgn format: http://identifiers.org/combine.specifications/sbgn.pd
  no metadata for './model/sbgn/Calzone2007.sbgn'
 8: location: ./model/sbgn/Calzone2007.gml format: http://purl.org/NET/mediatypes/text/plain
  no metadata for './model/sbgn/Calzone2007.gml'
 9: location: ./model/sbgn/Calzone2007.graphml format: http://purl.org/NET/mediatypes/application/xml
  no metadata for './model/sbgn/Calzone2007.graphml'
 10: location: ./model/sbgn/Calzone2007.png format: http://purl.org/NET/mediatypes/image/png
  no metadata for './model/sbgn/Calzone2007.png'
 11: location: ./model/sbgn/Calzone2007.pdf format: http://purl.org/NET/mediatypes/application/pdf
...

The metadata.rdf contains all the information, but not accessible via libCombine.

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:vCard="http://www.w3.org/2006/vcard/ns#">
    <rdf:Description rdf:about=".">
        <dcterms:description>archive created using masymos2CAT by cloning the repository http://models.cellml.org/workspace/calzone_thieffry_tyson_novak_2007 (Tue Apr 07 14:18:50 CEST 2015) -- see https://sems.uni-rostock.de</dcterms:description>
        <dcterms:creator>
            <rdf:Bag>
                <rdf:li rdf:parseType="Resource">
                    <vCard:n rdf:parseType="Resource">
                        <vCard:family-name>Scharm</vCard:family-name>
                        <vCard:given-name>Martin</vCard:given-name>
                    </vCard:n>
                    <vCard:email>martin.scharm@uni-rostock.de</vCard:email>
                    <vCard:org rdf:parseType="Resource">
                        <vCard:organization-name>University of Rostock</vCard:organization-name>
                    </vCard:org>
                </rdf:li>
            </rdf:Bag>
        </dcterms:creator>
        <dcterms:created rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-05-27T16:09:10Z</dcterms:W3CDTF>
        </dcterms:created>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-05-27T16:09:10Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-05-27T16:09:28Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-05-27T16:09:32Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-05-27T16:09:34Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-05-27T16:09:36Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-06-03T10:30:45Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-06-11T11:43:31Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-06-11T13:10:05Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-06-11T13:16:25Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:created rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-06-11T13:31:54Z</dcterms:W3CDTF>
        </dcterms:created>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-06-11T16:43:21Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-06-11T16:47:16Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-06-11T18:12:06Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2015-06-11T15:55:16Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2016-06-03T21:15:39Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2016-06-05T11:55:27Z</dcterms:W3CDTF>
        </dcterms:modified>
        <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2016-10-13T09:40:00Z</dcterms:W3CDTF>
        </dcterms:modified>
    </rdf:Description>
    <rdf:Description rdf:about="./README.md">
        <dcterms:description>README describing the archive</dcterms:description>
        <dcterms:creator>
            <rdf:Bag>
                <rdf:li rdf:parseType="Resource">
                    <vCard:n rdf:parseType="Resource">
                        <vCard:family-name>Scharm</vCard:family-name>
                        <vCard:given-name>Martin</vCard:given-name>
                    </vCard:n>
                    <vCard:email>martin.scharm@uni-rostock.de</vCard:email>
                    <vCard:org rdf:parseType="Resource">
                        <vCard:organization-name>University of Rostock</vCard:organization-name>
                    </vCard:org>
                </rdf:li>
...

The example only displays the single creator for . (there are multiple creators), does not show the modified and misses all metadata for the other entries.

Probably the hashmap for the metadata given locations is not populated correctly.

M

fbergmann commented 7 years ago

this is fixed, however some of the creators in the showcase were not valid (invalid email addresses, missing family names). I added support for it, but I'm not sure others do.

binfalse commented 7 years ago

@fbergmann could you please be a bit more specific with what you mean with 'not valid'? The creators' names are defined using vcard:n, which has a range of vcard:Name, which in turn can have at most 1 vcard:family-name and at most 1 vcard:given-name property. I'm not sure, but I don't think both MUST be present..?

And for the mail addresses: I guess you're referring to nobody@models.cellml.org and Hanne@hanne-nielsens-macbook.local? Do you actually parse and verify mail addresses - and if so, how do you check if they are valid? Granted, .local is not valid on the big internet but may be valid in privat networks etc...

After all that is what users actually provide, eg. see the Author information on top of the page: http://models.cellml.org/workspace/calzone_thieffry_tyson_novak_2007/file/e6e9d2607d3fbcadd534e7f4ceeb5767ccc477cd/calzone_thieffry_tyson_novak_2007.cellml

In my opinion that shouldn't be treated as invalid..

fbergmann commented 7 years ago

Hello Martin (@binfalse).

if you look at commit: https://github.com/sbmlteam/libCombine/commit/2cd7dc6661c0ab44e78c78acddfe4bab04aaec60 you see what I meant. Before I expected email addresses to be written as an RDF element hasEmail where I would read the email from the resource attribute. In the archive I instead found an email element with a text child. Other than that previously I've discared vcards when not both family and given name were specified.

The reason for my argument was that in the omex specification only one format was listed. I don't mind accepting both, but I'm not sure other tools would. (If I had my way we would have used the same vcard format as SBML, but that ship has sailed :) ).