opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Implement new data downloads page #1461

Closed andrewhercules closed 3 years ago

andrewhercules commented 3 years ago

As part of #1411, we will implement a new data downloads page that allows users to download a larger list of files. The implementation is based on each data file being included in a JSONlines file that will be produced by the ETL pipelines.

Can we please implement a new data downloads page based on the following specification (v2.4)?

User visits /downloads/data page

Data Download Page Spec - version 2

At the top of the page, please display the following text:

The Open Targets Platform is committed to open data and open access research and all of our data is publicly available for download and can be used for academic or commercial purposes. Please see our Licence documentation for more information.

Current data version: 21.04

Access archived datasets via FTP

Please link "Licence documentation" to https://platform-docs.opentargets.org/licence and please link "FTP" to http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/.

The data version is available by calling the GraphQL meta endpoint and requesting dataVersion.year and dataVersion.month (e.g. sample query returning 21.02).

The list of datasets are available in a JSONlines file based on the output of the ETL pipeline - list-of-datasets.json. The dataset labels, description, show/hide status, and order are available in a JSON file - dataset-mapping-file.json. Both files share the same id value so that using the id from the list-of-datasets.json file will correspond with an entry in the dataset-mapping-file.json file.

json-files-for-data-downloads-page.zip

Please integrate both JSON files and show in a data table (include search and show more rows functionality).

Within the dataset-mapping-file.json file, please use the include_in_fe boolean to determine if the file should be shown in the data table and use the order to determine the order the files should be shown.

Table column Value Example
Dataset nice_name from dataset-mapping-file.json Associations - direct (overall score only)
Description description from dataset-mapping-file.json Overall scores for direct target-disease associations
Formats resource.format from list-of-datasets.json json

User clicks on any of the chips in the "Format(s)" column

Data Download Page Spec - version 2 (2)

Please use the drawer component to open up a view with tabs for each format. Within each tab, please provide the relevant URLs for FTP and Google Cloud Access and the Font-Awesome copy icon that copies the URL to the user's clipboard.

FTP

baseUrl: ftp.ebi.ac.uk/pub/databases/opentargets/platform/ dataVersion.year: use GraphQL API meta endpoint to retrieve dataVersion.year dataVersion.month: use GraphQL API meta endpoint to retrieve dataVersion.month filePath: use resource.path value

The format of the URL should be baseUrl + dataVersion.year + . + dataVersion.month + /output/ETL/ + filePath

For example, ftp.ebi.ac.uk/pub/databases/opentargets/platform/21.04/output/ETL/associationByOverallDirect

Google Cloud

baseUrl: gs://open-targets-data-releases/ dataVersion.year: use GraphQL API meta endpoint to retrieve dataVersion.year dataVersion.month: use GraphQL API meta endpoint to retrieve dataVersion.month filePath: use resource.path value

The format of the URL should be gsutil ls + baseUrl + dataVersion.year + . + dataVersion.month + /output/ETL/ + filePath

For example, gsutil ls gs://open-targets-data-releases/21.04/output/ETL/associationByOverallDirect

andrewhercules commented 3 years ago

@mirandaio we have updated the list of datasets and the dataset mapping file - the structure remains the same but we have updated some of the fields (e.g. descriptions). Please use the files below:

json-files-for-data-downloads-page-version-2.zip

d0choa commented 3 years ago

Some decisions on the meeting today:

  1. Contact BE to check metadata containing the files
  2. Display format(s) JSON and Parquet as 2 chips
  3. When opening the drawer: No-tab would be needed as they would be specific of json or parquet
  4. The content of the widget can contain 3 sections .1 FTP link - This will contain a link in the form http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/21.02/output/... .1 wget command - This will contain the command wget -r -np -nH --cut-dirs 7 ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/21.02/output/ETL_parquet/diseases .1 Google Cloud Platform (paywalled) - This will contain the command gsutil -m cp -r gs://open-targets-data-releases/21.04/output/etl-parquet/diseases .
andrewhercules commented 3 years ago

Also, please update the URL of the downloads page to /downloads

andrewhercules commented 3 years ago

Looks great @mirandaio! 👍 Can we please make a few changes and then open a PR?

For sample scripts to download and parse datasets using Python or R, please visit our Data Downloads documentation

Option baseURL
rsync rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/
wget wget --recursive --no-parent --no-host-directories --cut-dirs 7 \ ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/
FTP ftp.ebi.ac.uk/pub/databases/opentargets/platform/
Google Cloud (paywalled) gsutil -m cp -r gs://open-targets-data-releases/

The datasetFormat value will be either json or parquet depending on which chip is selected.

For rsync, wget, and Google Cloud, please add a space and full stop at the end of the URL .

For example, the rsync command to access 21.04 disease data in Parquet would be rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/21.04/etl/output/parquet/diseases .

d0choa commented 3 years ago

this looks great. I would reorder the last table to do:

  1. FTP
  2. rsync
  3. wget
  4. Google Cloud (paywalled) I think this order better represents the number of users that might be interested in each method