gigascience / gigadb-website

Source code for running GigaDB
http://gigadb.org
GNU General Public License v3.0
9 stars 15 forks source link

Provide Data Package on individual dataset pages #1095

Open rgaiacs opened 2 years ago

rgaiacs commented 2 years ago

User story

As a website user
I want to download a Data Package that provide me with a list of all files in the dataset
So that I do not need to web scrap the dataset page or click to download every single item that I want.

Acceptance criteria

Given a dataset is created in GigaDB When a website user visits the dataset page
Then a link to the Data Package JSON file is listed next to the FTP link data-package

Additional Info

Product Backlog Item Ready Checklist

Product Backlog Item Done Checklist

This is part of Epic #1118

only1chunts commented 2 years ago

@rgaiacs - am I correct to think that the Frictionless Data Package would only contain information about the tabular data files?

Some points we may need to consider prior to adding this button include:

rgaiacs commented 2 years ago

am I correct to think that the Frictionless Data Package would only contain information about the tabular data files?

You are incorrect. From https://specs.frictionlessdata.io/:

A [Frictionless] Data Package is a simple container format used to describe and package a collection of data (a dataset).

A Data Package can contain any kind of data. At the same time, Data Packages can be specialized and enriched for specific types of data so there are, for example, Tabular Data Packages for tabular data, Geo Data Packages for geo data etc.

The Frictionless Data Package team has focus most of their work on Frictionless Tabular Data Packages that is limited to tabular data.

what to do with non-tabular data files

They are listed. You don't need to do anything. It is up for the user.

potential to include a tabular data file of the sample information (taken from the database)

This is possible.

only1chunts commented 2 years ago

Actually this is looking like a machine readable and well formatted version of our "readme" files. i.e. it has the dataset metadata (resource details) held in the "data package" header region of the json file, followed by each file with all their metadata in the "data resource" section-1 block per file (see below) the only thing that I dont see a natural place for are the links to the externally hosted associated data, e.g. BioProject accessions, manuscript links, proteomexchange, EGA, etc... but presumably we can add our own attributes to accommodate those things.

{
  "name": "our DOI number",
  "datapackage_version": "1.0-beta",
  "title": "gigadb dataset title",
  "description": "...",
  "version": "1.0",
  "keywords": ["name", "My new keyword"],
  "licenses": [{
    "url": "http://opendatacommons.org/licenses/pddl/",
    "name": "Open Data Commons Public Domain",
    "version": "1.0",
    "id": "odc-pddl"
  }],
// I'm not sure what this section holds in relation to GigaDB?
  "sources": [{ 
    "name": "World Bank and OECD",
    "web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
  }],
// This will hold the author list, it can include ORCIDs
  "contributors":[{
    "title": "Joe Bloggs",
    "email": "joe@bloggs.com",
    "web": "http://www.bloggs.com"
  }],
  "maintainers": [{
    // like contributors
  }],
// GigaScience Database can be put in here?
  "publishers": [{
    // like contributors
  }],
// Could any links to other resources go in here? e.g. BioProjects, Manuacripts etc...
  "dependencies": {
    "data-package-name": ">=1.0"
  },
// The each of the files gets listed individually here as resources
  "resources": [
{
  "name": "file_name.csv",
  "path": "http://ftp.cngb.cn/pub/..../file_name.csv",
  "title": "", // we dont use title in GigaDB
  "description": "add the file description here",
  "format": "csv", //we call this file format
  "mediatype": "text/csv", // we call this data type
  "encoding": "utf-8",
  "bytes": 1, // files size
  "hash": "", // we use md5sum values
  "schema": "", // this section can be used to define columns in tabular data
  "sources": "",
  "licenses": "" // this will be cc0 unless there is a specific attribute assigned to a file
}
{ 
// repeat as required
    }
  ],
  // extend your datapackage.json with attributes that are not
  // part of the data package spec
  // we add a views attribute to display Recline Dataset Graph Views
  // in our Data Package Viewer
  "views" : [
    {
      ... see below ...
    }
  ],
  // you can add your own attributes to a datapackage.json, too
  "my-own-attribute": "data-packages-are-awesome",
}
rgaiacs commented 2 years ago

the only thing that I dont see a natural place for are the links to the externally hosted associated data, e.g. BioProject accessions, manuscript links, proteomexchange, EGA, etc... but presumably we can add our own attributes to accommodate those things.

Yes, you can add extra attributes.