Open ptsefton opened 6 years ago
Hi Peter, thanks for getting in touch! It would be great to align efforts. Not sure that any of us will be in Amsterdam for Research Object meeting, but would be happy to follow-up online at least.
Hey @ptsefton! I will be there to give a talk on packaging formats. I'd love to chat with you while there!
Wow, Calcyte is super similar to what we thought up. I'm taking a deep dive into this now!
@amoeba Calcyte documentation is really lacking I'll try to get on to that ASAP, but meanwhile if you want some help, let me know. I worked with @cameronneylon on his dataset via Google Drive - helping with the spreadsheet files - that's an option if you'd like to explore.
Thanks, I'm keen to see how you structured your spreadsheets. With dataspice, we decided to start with three primary input methods to get the spreadsheets filled in:
set_title("My title")
I think option 3 was the most promising because we still want to provide ample guidance to the user on how best to fill in the templates. At first glance, it looks like you might be solving some of this with how you designed your spreadsheets. I'll have to take a look!
@amoeba Just pushed an update to Calcyte to fix tests and a bug creating new spreadsheets.
The idea for the spreadsheets came from Mike Lake (@speleolinux) on a previous project; they're generated by the calcyte script and list all the files it finds. Spreadsheets do work but they're not a great UI, and don't really help much with all the relations we want to encourage users to put in so we can link people to the files they created, etc (also tried YAML and that got old very quickly). We have been looking at the best way to provide an interactive app for this and were thinking along the same lines; a web app installed locally, as well as a central one that can talk to data-repositories via their APIs. We want to make it so the user can look up contextual metadata, such as people and places and equipment, using auto-complete and drop downs. We have a project at UTS to build a service catalog/provisioning system for research apps and data storage - part of this will be an index of contextual entities, people, machines etc.
We will definitely will take a look at your Shiny thing ASAP.
Re the structure of Calcyte spreadsheets - we used xlsx because it supports multiple workbooks, meaning you can have tables of people, places, equipment etc.
(But calcyte is not the only tool, we have people working on scripts to export from a range of data-management systems such as Omero)
Thanks for the info. I think metadata authoring is both something a lot of groups have worked on and also something that's not quite "there yet". One of NCEAS' oldest and now nearly-deprecated tools, Morpho (PDF), does (I think) a really nice job of helping author metadata about datasets, export locally, upload to a repository, and even can eventually produce a PDF and HTML representation of the metadata similar to dataspice and DataCrate. NCEAS is now working on a web-based package editor that speaks EML and OAI-ORE Resource Maps as a modern replacement to Morpho. These types of tools target users whose datasets are not necessarily part of a data management system.
I've had a go at transforming Dataspice metadata into datacrate metadata to see what happened. I wrote a quick script to transform the metadata to the DataCrate standard (flattened JSON, with an inline @context).
The result is here: https://data.research.uts.edu.au/examples/v1.0/dataspice-crate/CATALOG.html
I have included the files, but I didn't give them a user-friendly name and description.
In a DataCrate we'd usually try to get a DOI for the dataset, eg by pre-issuing one in Zenodo, and to add URIs for the people.
TODO: Add a map display to Calcyte-generated HTML.
Very nice @ptsefton! Since dataspice metadata is just Schema.org-only JSON-LD, I imagine the translation was simple? I notice your JSON-LD structure is quite different from ours. I am new to JSON-LD but it looks like you're using a name graph and lots of blank nodes in your serialization. IIRC you can convert (frame?) that format into a more compact one. Is there a reason the DataCrate CATALOG.json
is formatted that way?
@ptsefton yes, looks to me like your format is effectively just applying flatten()
algorithm to the Dataspice example? (see playground example)
Like @amoeba says, I think going back to the original format can be done by just applying a frame
Technically I think most tools should be agnostic to the format, i.e. a tool designed to consume dataspice
would probably always apply its desired frame or compaction to get the data into a canonical format anyway, which means it should be compatible with DataCrate data (or anyone else using the schema.org/Dataset description). (Likewise google's dataset search engine should be equally happy parsing both formats).
Also, it looks like you use https://
urls, note that the canonical urls for Schema.org are specified in http://
. (You may need to use the canonical form for things like Google's Dataset search to be able to recognize and parse this).
@cboettig Thanks Carl - I'll fix the URLs to be canonical in the spec.
@amoeba We specified the flattened format because it is MUCH easier for programmers to work with than a nested structure - they can build a simple index of the graph by @id
and path
then traverse it without having to mess with JSON-LD processing as such. Your example is very simple, but when you have hundreds of files, with multiple creators, locations etc then framing gets very hard (I started out designing DataCrate with framed, nested data but it didn't scale and it was too hard to process).
Re the blank nodes, the goal is to get rid of a lot of them, though the ones for properties are unavoidable. For example, I think it's best if people have @ids, preferably ORCIDs (one of the people in your sample seems to have one, and the other doesn't but I didn't mess with your data and add them). Likewise for equipment, software etc; it's best if these have their own URLs. This is all covered in the spec. (I also need to make the HTML not display blank-node IDs as they're not helpful).
@ptsefton that's a great point about scaling with nested vs flat format, and a compelling reason for the flat format.
I'd love to learn a bit more about how you're processing / traversing the data. e.g. do you query the JSON text directly (good ol cat/sed/awk pipes)? Use JQ or other json query language? Import into Virtuoso or other RDF database and do SPARQL? Treat the flat graph as a table object and query with standard SQL? Or Query json files directly with SQL queries via Apache Drill or similar? (I've played around with a variety of these but haven't really settled on a strategy myself)
@cboettig Processing / traversing has so far mostly been limited to generating the HTML pages for DataCrate, parsing DataCrates for upload. Here's a little hack I did for a conference - a quick and dirty python script to visualise provenance relationships in our sample file using Plantuml, which I happen to know . You can see in that linked script how simple it is to build an index so things can be looked up by @id
. AFAIK there are no Python or Jvascript JSON-LD libraries that make this sort of processing easy.
id_lookup = {}
for item in dc["@graph"]:
id_lookup[item["@id"]] = item
The result:
@startuml
"Peter Sefton" as Peter_Sefton
[CreateAction:\n Took dog picture] -up-> Peter_Sefton : agent
[CreateAction:\n Took dog picture] -down-> [ImageObject:\npics/2017-06-11 12.56.14.jpg] : result
[CreateAction:\n Took dog picture] -> [Place:\nCatalina Park] : object
[CreateAction:\n Took dog picture] --down--> [IndividualProduct:\nEPL1 Camera] : instrument
[CreateAction:\n Took dog picture] --down--> [IndividualProduct:\nLumix G 20/F1.7 lens] : instrument
[CreateAction:\n Converted dog picture to sepia] -up-> Peter_Sefton : agent
[CreateAction:\n Converted dog picture to sepia] -down-> [ImageObject:\npics/sepia_fence.jpg] : result
[CreateAction:\n Converted dog picture to sepia] -> [ImageObject:\npics/2017-06-11 12.56.14.jpg] : object
[CreateAction:\n Converted dog picture to sepia] --down--> [SoftwareApplication:\nImageMagick] : instrument
@enduml
Hi,
I just became aware of this project, which looks very promising.
We have been working on a similar effort to package research data with schema.org json, also with an HTML file, the DataCrate spec. Looks like the JSON-LD we're producing with various tools (all of which are still in alpha) is quite similar to that here.
Anyway, I think we should look at aligning our efforts. Will anyone from this project be at the Research Object meeting on October 29 in Amsterdam? - I will.