qri-io / qri

you're invited to a data party!
https://qri.io
GNU General Public License v3.0
1.11k stars 66 forks source link

Archival dataset ingest #87

Closed b5 closed 6 years ago

b5 commented 7 years ago

over at @datatogether we're writing a number of archival-quality tools, I think we should modify the dataset spec to include all archival details that the WARC spec assumes should be covered.

Once we've implemented #86, we should look to make those url Ingest's contain archival backing information.

b5 commented 6 years ago

Ok, a few changes, general plan to make this work. I think we should flip this around & make it possible to add dataset.json references to dt archives. This will provide the dt archive with the added benefit of POD / DCAT / qri / JSONLD ready metadata, and will make it possible to register a DT archive with qri. As a side benefit qri can be used to edit the metadata of an archive, and the cdxj index will become a sql-queryable database, making it possible to

The benefit to qri is this archive can be used as a starting point for extracting further datasets. As an example a user could extract data from an excel file within the archive, referencing the archive as the initial dataset from which the information was extracted. This provides archival provenance as the foundation to build a dataset upon. Users will be able to return to pages in the archive to get the context in which the original data was presented as a reference for writing metadata or understanding changes over time.

This ups the need for json selector syntax support within qri's dataset_sql, it would allow users to query data stored within the cdxj json field, which would be hella dope.

Steps to complete an initial implementation: