Open timodonnell opened 8 years ago
Reached out again - there are new data submissions to http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=phs000159 but still no clear IDs tracking it back to the paper data. Will update if we hear back.
Very cool! Are you able to download?
Have to re-learn how this NCBI download system works, but hopefully!
Here's a link to the data on dbGaP: http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000159.v8.p4
Going to use Arun's notes to see if i can get this data today.
Ah, http://aml31.genome.wustl.edu is a site the authors put together to describe this data. On this page they link to an R package that they used for benchmarking.
After some digging it appears that limiting files to those with a submitted subject id
of 452198
will get us most of what we want (45/52 files in the spreadsheet, I think). Here's what's missing:
Some of these missing files appear to be available as supplementary information for the paper.
Update: I've got the first of these files downloading! I'm going to grab some dinner then come back to this project to see if I can download the rest of these files overnight.
Here's a spreadsheet version of the run table with new columns to track the progress getting these files onto HDFS under /datasets/aml31
My sequence of commands:
~ahujaa01/sratoolkit.2.5.4-1-centos_linux64/bin/prefetch --max-size 1146790000 SRR2470200
~ahujaa01/sratoolkit.2.5.4-1-centos_linux64/bin/sam-dump SRR2470200.sra | /hpc/users/willir31/bin/samtools view -bS - > SRR2473393.bam
hadoop fs -copyFromLocal SRR2470200.bam /datasets/aml31
Wrote a little bash script that seems to be working so i'm going to put it under nohup
and hopefully we'll have all these files on the cluster soon.
#!/bin/bash
runs="SRR2170057
SRR2177298
SRR2177289"
for run in $runs
do
~ahujaa01/sratoolkit.2.5.4-1-centos_linux64/bin/prefetch --max-size 1146790000 "$run"
~ahujaa01/sratoolkit.2.5.4-1-centos_linux64/bin/sam-dump "$run".sra | /hpc/users/willir31/bin/samtools view -bS - > "$run".bam
hadoop fs -copyFromLocal "$run".bam /datasets/aml31
rm -rf "$run"*
done
Noting some of our download problems at https://github.com/ncbi/sra-tools/issues/23#issuecomment-243601994 and getting some quick responses
Update: 22 files on HDFS, 7 of which look busted in some way. 23 files left to go. I'm downloading these files startng from the smallest to the largest, so the time taken for the remaining half of the files will likely be several days.
We appear to be the first lab to try to download these files, so we're finding many bugs on SRA's side. Kinda cool that we'll be the first to work with this data outside of WashU though.
I've finally re-started these downloads as the guy from SRA says he's repaired all of the files.
@arahuja okay all of the BAMs that I pulled down from SRA are on HDFS in /datasets/aml31
. Unfortunately many of them appear to be truncated for whatever reason. Nicolas Robine from NYGC claims he just pulled down the FASTQ files, but I didn't see them, and dbGaP appears to be down now, so I'll look for them tomorrow.
Ugh think I should have been using fastq-dump
instead of sam-dump
on these files...
Okay rerunning over all files to get .sra
and .fastq
files in addition to the (frequently truncated) BAMs.
http://www.cell.com/cell-systems/abstract/S2405-4712(15)00113-1
@arahuja looked into this a few months and said it still wasn't in dbGAP