hammerlab / pygdc

Python API for Genomic Data Commons
Apache License 2.0
18 stars 2 forks source link

Allow direct download of files (VCFs) #2

Open arahuja opened 8 years ago

arahuja commented 8 years ago

Should be able to use what @jburos has here: https://github.com/jburos/tcga-blca/blob/master/query_tcga/query_tcga.py#L246-L302

jburos commented 8 years ago

Thanks for making this an issue! Makes it easier to discuss. I made repo public so that links would work better.

In hopes of fostering discussion, wanted to note some design decisions I've been thinking about.

Enable easy switch between using gdc-client vs api

  1. Implementation in tcga-blca has concept of a DATA_DIR -- root directory in which files get loaded. This currently mirrors directory structure imposed by gdc-client.
    • Ideally this structure should be the same irrespective of whether user downloaded via api or gdc-client.
    • Also, this directory structure may or may not be conducive to Cohorts integration
    • Presumably, gdc-client could change this directory structure.

Authorization

  1. Process for linking to / referencing authorization token file
  2. It would be nice if nothing broke if auth file not present

The desired user interface for downloading files

What should the command look like for downloading files? Should this be get_vcfs, analogous to get_cases, with download behind the scenes? Or should the download event be handled more explicitly? We should probably also reconcile the current approaches to filtering in order to support this.

extensibility

should download of VCFs be different from that for other sample-data files (e.g. Raw Sequencing Data)?