Provide Data Package on individual dataset pages

rgaiacs commented 2 years ago

User story

As a website user
I want to download a Data Package that provide me with a list of all files in the dataset
So that I do not need to web scrap the dataset page or click to download every single item that I want.

Acceptance criteria

Given a dataset is created in GigaDB When a website user visits the dataset page
Then a link to the Data Package JSON file is listed next to the FTP link

Additional Info

https://raniere-phd.gitlab.io/frictionless-data-handbook/

Product Backlog Item Ready Checklist

[ ] Business value is clearly articulated
[ ] Item is understood enough by the IT team so it can make an informed decision as to whether it can complete this item
[ ] Dependencies are identified and no external dependencies would block this item from being completed
[ ] At the time of the scheduled sprint, the IT team has the appropriate composition to complete this item
[ ] This item is estimated and small enough to comfortably be completed in one sprint
[ ] Acceptance criteria are clear and testable
[ ] Performance criteria, if any, are defined and testable
[ ] The Scrum team understands how to demonstrate this item at the sprint review

Product Backlog Item Done Checklist

[ ] Item(s) in increment pass all Acceptance Criteria
[ ] Code is refactored to best practices and coding standards
[ ] Documentation is updated as needed
[ ] Data security has not been compromised (with particular reference to the personal information we hold in GigaDB)
[ ] No deviation from the team technology stack and software architecture has been introduced
[ ] The product is in a releasable state (i.e. the increment has not broken anything)

This is part of Epic #1118

only1chunts commented 2 years ago

@rgaiacs - am I correct to think that the Frictionless Data Package would only contain information about the tabular data files?

Some points we may need to consider prior to adding this button include:

validation of tabular data files during submission
what to do with non-tabular data files
potential to include a tabular data file of the sample information (taken from the database)
generation of the json data package file itself
consider whether a button is the right option or whether it should just be added to the file list as another file with a description, file type, format attributes etc...

rgaiacs commented 2 years ago

am I correct to think that the Frictionless Data Package would only contain information about the tabular data files?

You are incorrect. From https://specs.frictionlessdata.io/:

A [Frictionless] Data Package is a simple container format used to describe and package a collection of data (a dataset).

A Data Package can contain any kind of data. At the same time, Data Packages can be specialized and enriched for specific types of data so there are, for example, Tabular Data Packages for tabular data, Geo Data Packages for geo data etc.

The Frictionless Data Package team has focus most of their work on Frictionless Tabular Data Packages that is limited to tabular data.

what to do with non-tabular data files

They are listed. You don't need to do anything. It is up for the user.

potential to include a tabular data file of the sample information (taken from the database)

This is possible.

only1chunts commented 2 years ago

Actually this is looking like a machine readable and well formatted version of our "readme" files. i.e. it has the dataset metadata (resource details) held in the "data package" header region of the json file, followed by each file with all their metadata in the "data resource" section-1 block per file (see below) the only thing that I dont see a natural place for are the links to the externally hosted associated data, e.g. BioProject accessions, manuscript links, proteomexchange, EGA, etc... but presumably we can add our own attributes to accommodate those things.

{
  "name": "our DOI number",
  "datapackage_version": "1.0-beta",
  "title": "gigadb dataset title",
  "description": "...",
  "version": "1.0",
  "keywords": ["name", "My new keyword"],
  "licenses": [{
    "url": "http://opendatacommons.org/licenses/pddl/",
    "name": "Open Data Commons Public Domain",
    "version": "1.0",
    "id": "odc-pddl"
  }],
// I'm not sure what this section holds in relation to GigaDB?
  "sources": [{ 
    "name": "World Bank and OECD",
    "web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
  }],
// This will hold the author list, it can include ORCIDs
  "contributors":[{
    "title": "Joe Bloggs",
    "email": "joe@bloggs.com",
    "web": "http://www.bloggs.com"
  }],
  "maintainers": [{
    // like contributors
  }],
// GigaScience Database can be put in here?
  "publishers": [{
    // like contributors
  }],
// Could any links to other resources go in here? e.g. BioProjects, Manuacripts etc...
  "dependencies": {
    "data-package-name": ">=1.0"
  },
// The each of the files gets listed individually here as resources
  "resources": [
{
  "name": "file_name.csv",
  "path": "http://ftp.cngb.cn/pub/..../file_name.csv",
  "title": "", // we dont use title in GigaDB
  "description": "add the file description here",
  "format": "csv", //we call this file format
  "mediatype": "text/csv", // we call this data type
  "encoding": "utf-8",
  "bytes": 1, // files size
  "hash": "", // we use md5sum values
  "schema": "", // this section can be used to define columns in tabular data
  "sources": "",
  "licenses": "" // this will be cc0 unless there is a specific attribute assigned to a file
}
{ 
// repeat as required
    }
  ],
  // extend your datapackage.json with attributes that are not
  // part of the data package spec
  // we add a views attribute to display Recline Dataset Graph Views
  // in our Data Package Viewer
  "views" : [
    {
      ... see below ...
    }
  ],
  // you can add your own attributes to a datapackage.json, too
  "my-own-attribute": "data-packages-are-awesome",
}

rgaiacs commented 2 years ago

the only thing that I dont see a natural place for are the links to the externally hosted associated data, e.g. BioProject accessions, manuscript links, proteomexchange, EGA, etc... but presumably we can add our own attributes to accommodate those things.

Yes, you can add extra attributes.

gigascience / gigadb-website