rbloom5 / ImmuneRep

1 stars 0 forks source link

Data retrieval #1

Closed rzeller closed 9 years ago

rzeller commented 9 years ago

Figure out how to download and convert the data into a python-readable format.

rzeller commented 9 years ago

I've figured out how to download the data, convert it from .sra to .fasta and read it into bipython using SeqIO. Instructions are now on the data retrieval page on the wiki.

rzeller commented 9 years ago

I'm not sure which datasets to download. Bloom, if you figure that part out, I can write a bash script that downloads all the data and converts it into .fasta format.

rbloom5 commented 9 years ago

Cool, can you convert it to Fastq instead? that contains information about the quality of each base read that will be useful in my step of the pipeline. I think seqIO should support it http://biopython.org/wiki/SeqIO

rzeller commented 9 years ago

Yeah, no problem. I just changed the instructions to dump and load .fastq files.

Let me know which datasets we should be working with.

rbloom5 commented 9 years ago

Here are some MS patients: http://www.ncbi.nlm.nih.gov/sra/SRX551536[accn]

http://www.ncbi.nlm.nih.gov/sra/SRX553103[accn] http://www.ncbi.nlm.nih.gov/sra/SRX553102[accn] http://www.ncbi.nlm.nih.gov/sra/SRX553101[accn]

http://www.ncbi.nlm.nih.gov/sra/SRX553100[accn] http://www.ncbi.nlm.nih.gov/sra/SRX553099[accn] http://www.ncbi.nlm.nih.gov/sra/SRX553098[accn]

http://www.ncbi.nlm.nih.gov/sra/SRX552923[accn] http://www.ncbi.nlm.nih.gov/sra/SRX552922[accn] http://www.ncbi.nlm.nih.gov/sra/SRX552921[accn]

http://www.ncbi.nlm.nih.gov/sra/SRX552916[accn] http://www.ncbi.nlm.nih.gov/sra/SRX552915[accn] http://www.ncbi.nlm.nih.gov/sra/SRX552914[accn]

http://www.ncbi.nlm.nih.gov/sra/SRX552900[accn] http://www.ncbi.nlm.nih.gov/sra/SRX552899[accn] http://www.ncbi.nlm.nih.gov/sra/SRX552898[accn]

Looking for controls to compare to now

On Thu, Dec 4, 2014 at 3:45 PM, Robby Zeller notifications@github.com wrote:

Yeah, no problem. I just changed the instructions to dump and load .fastq files.

Let me know which datasets we should be working with.

— Reply to this email directly or view it on GitHub https://github.com/rbloom5/ImmuneRep/issues/1#issuecomment-65724844.

rbloom5 commented 9 years ago

Also, it looks like most of the data from NCBI is mirrored here http://www.ebi.ac.uk/ena and they have it it different (more useful) formats. You can just search the accession number

rzeller commented 9 years ago

I've added getdata.py to the data-retrieval branch. It will download all the .sra files for the links above and convert them into the .fastq format. For information on how to use it, see the data retrieval page on the wiki. I'm going to close this issue. Let me know if getdata.py works for you guys.

rzeller commented 9 years ago

In getdata.py, SRR1383446 was supposed to be SRR1383326. I've updated it in the data-retrieval branch. If you're running the old version of getdata.py, it will throw an error. You can always cut off the beginning of the list to start from where it broke.