sbmlteam / libCombine

a C++ library for working with the COMBINE Archive format
BSD 2-Clause "Simplified" License
8 stars 5 forks source link

Provide option to get a list of OMEX metadata files in a COMBINE archive #39

Closed jonrkarr closed 3 years ago

jonrkarr commented 3 years ago

Although metadata files appear in manifest XML files, libCOMBINE manifest objects don't show them. This would be helpful for working with the new libOMEXMeta library.

For example, the code below doesn't print out the OMEX Metadata files

import libcombine
archive = libcombine.CombineArchive()
archive.initializeFromArchive('Caravagna-J-Theor-Biol-2010-tumor-suppressive-oscillations.omex')
for location in archive.getAllLocations():
   print(location)

manifest = archive.getManifest()
for content in manifest.getListOfContents():
    print(content.getLocation())
fbergmann commented 3 years ago

Right, the library was written with the limited metadata support as outlined in the combine archive specification. Unfortunately, when omex-meta came out, they decided to use the same namespace, which is the reason for these issues. I'll think of an option to completely disable all metadata processing.

could i get a test file that was generated by omexmeta?

jonrkarr commented 3 years ago

Thanks for the quick reply.

I agree that using the same format in COMBINE manifests could be confusing. I think it could be helpful for the new version to append a version number to the format.

Here's one example file I've discussed with the team for the new library.

My understanding is that the new standard is a graph of RDF triples. The new library can read/write this from multiple formats including RDF-XML. I'm not sure how the particular format is intended to be captured by the CaContent.format attribute. On top of this, the new standard has guidelines for referencing objects (e.g., rdf:about="http://omex-library/OmexFile.omex" rather rdf:about=".") and for specific predicates (e.g, dc:creator for author) and objects (e.g., ORCID id to identify a person).

The new library considers the files created by this library valid. But the new guidelines would recommend encoding the metadata differently, such as FOAF instead of VCard and only storing author ORCIDs and not their names.

Because the new format/library are intended to be used differently, I think don't it maps perfectly to libCOMBINE's OMEXDescription class. I think its sufficient for people to work with the new library to use the new files.

jonrkarr commented 3 years ago

For now, I can work around this by using libCOMBINE to unpack the manifest.xml file, extracting the locations of the metadata files from this (e.g., with lxml), and then appending these locations to the internal data structure we're using to represent archives.

fbergmann commented 3 years ago

Actually, this has already been addressed in a past release i think, i added an optional flag, when opening from archive, to stop all manifest processing:

bool initializeFromArchive(archiveFile, skipOmex /*= false*/)

So if you pass True, then all the metadata elements will remain untouched (and no metadata elements be created), that way you can manually process the metadata.

jonrkarr commented 3 years ago

I'm looking for something simpler -- just the ability to read manifests, returning everything in them irrespective of the format. This is addressed by the examples referenced in #42.