hammerlab / variant-calling-benchmarks

Automated and curated variant calling benchmarks for Guacamole
Apache License 2.0
2 stars 1 forks source link

Download AML31 data from "Optimizing Cancer Genome Sequencing and Analysis" #5

Open timodonnell opened 8 years ago

timodonnell commented 8 years ago

http://www.cell.com/cell-systems/abstract/S2405-4712(15)00113-1

@arahuja looked into this a few months and said it still wasn't in dbGAP

arahuja commented 8 years ago

Reached out again - there are new data submissions to http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=phs000159 but still no clear IDs tracking it back to the paper data. Will update if we hear back.

arahuja commented 7 years ago

Available!

https://twitter.com/malachigriffith/status/761600141170180097

timodonnell commented 7 years ago

Very cool! Are you able to download?

arahuja commented 7 years ago

Have to re-learn how this NCBI download system works, but hopefully!

hammer commented 7 years ago

Here's a link to the data on dbGaP: http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000159.v8.p4

Going to use Arun's notes to see if i can get this data today.

Ah, http://aml31.genome.wustl.edu is a site the authors put together to describe this data. On this page they link to an R package that they used for benchmarking.

After some digging it appears that limiting files to those with a submitted subject id of 452198 will get us most of what we want (45/52 files in the spreadsheet, I think). Here's what's missing:

Some of these missing files appear to be available as supplementary information for the paper.

hammer commented 7 years ago

Update: I've got the first of these files downloading! I'm going to grab some dinner then come back to this project to see if I can download the rest of these files overnight.

hammer commented 7 years ago

Here's a spreadsheet version of the run table with new columns to track the progress getting these files onto HDFS under /datasets/aml31

hammer commented 7 years ago

My sequence of commands:

  1. ~ahujaa01/sratoolkit.2.5.4-1-centos_linux64/bin/prefetch --max-size 1146790000 SRR2470200
  2. ~ahujaa01/sratoolkit.2.5.4-1-centos_linux64/bin/sam-dump SRR2470200.sra | /hpc/users/willir31/bin/samtools view -bS - > SRR2473393.bam
  3. hadoop fs -copyFromLocal SRR2470200.bam /datasets/aml31
hammer commented 7 years ago

Wrote a little bash script that seems to be working so i'm going to put it under nohup and hopefully we'll have all these files on the cluster soon.

#!/bin/bash

runs="SRR2170057
SRR2177298
SRR2177289"

for run in $runs
do
  ~ahujaa01/sratoolkit.2.5.4-1-centos_linux64/bin/prefetch --max-size 1146790000 "$run"
  ~ahujaa01/sratoolkit.2.5.4-1-centos_linux64/bin/sam-dump "$run".sra | /hpc/users/willir31/bin/samtools view -bS - > "$run".bam
  hadoop fs -copyFromLocal "$run".bam /datasets/aml31
  rm -rf "$run"*
done
hammer commented 7 years ago

Noting some of our download problems at https://github.com/ncbi/sra-tools/issues/23#issuecomment-243601994 and getting some quick responses

hammer commented 7 years ago

Update: 22 files on HDFS, 7 of which look busted in some way. 23 files left to go. I'm downloading these files startng from the smallest to the largest, so the time taken for the remaining half of the files will likely be several days.

We appear to be the first lab to try to download these files, so we're finding many bugs on SRA's side. Kinda cool that we'll be the first to work with this data outside of WashU though.

hammer commented 7 years ago

I've finally re-started these downloads as the guy from SRA says he's repaired all of the files.

hammer commented 7 years ago

@arahuja okay all of the BAMs that I pulled down from SRA are on HDFS in /datasets/aml31. Unfortunately many of them appear to be truncated for whatever reason. Nicolas Robine from NYGC claims he just pulled down the FASTQ files, but I didn't see them, and dbGaP appears to be down now, so I'll look for them tomorrow.

hammer commented 7 years ago

Ugh think I should have been using fastq-dump instead of sam-dump on these files...

hammer commented 7 years ago

Okay rerunning over all files to get .sra and .fastq files in addition to the (frequently truncated) BAMs.