juliema / aTRAM

BSD 3-Clause "New" or "Revised" License
36 stars 14 forks source link

format_sra creates files and hangs #154

Closed snacktavish closed 9 years ago

snacktavish commented 9 years ago

Running format_sra.pl -input SRR610374.fastq -out my_atram_db, creates a bunch of empty .bucket and .sorted files and then just hangs. Seems to get further if I run it on an sra file, (format_sra.pl -input SRR610374.sra -out my_atram_db), and creates .bucket files that are not empty, but also hangs, and all the .sorted are empty.

In both cases last line of log is: Dividing fasta/fastq file into buckets for sorting.

Advice? Thanks, this is a great idea and I look forward to using it!

juliema commented 9 years ago

Hi Emily,

Quick question: do you have paired end reads? If so did you concatenate them together into the SRR610374.fastq file? If not that is the problem, the program is looking for the both mate pairs.

let me know if that fixes things, if not we will dig further.

thanks!

Julie

On Mon, Mar 30, 2015 at 7:17 AM, Emily Jane McTavish < notifications@github.com> wrote:

Running format_sra.pl -input SRR610374.fastq -out my_atram_db, creates a bunch of empty .bucket and .sorted files and then just hangs. Seems to get further if I run it on an sra file, (format_sra.pl -input SRR610374.sra -out my_atram_db), and creates .bucket files that are not empty, but also hangs, and all the .sorted are empty.

In both cases last line of log is: Dividing fasta/fastq file into buckets for sorting.

Advice? Thanks, this is a great idea and I look forward to using it!

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

snacktavish commented 9 years ago

Yes, that file should have both paired ends. It is just a fastq-dump of an sra from a paired end experiment.

juliema commented 9 years ago

ok will you go into the test folder and run:

perl test_all.pl -debug

this will go through all the steps of aTRAM to make sure it is working properly. Let me know what output you get.

J

On Mon, Mar 30, 2015 at 10:01 AM, Emily Jane McTavish < notifications@github.com> wrote:

Yes, that file should have both paired ends. It is just a fastq-dump of an sra from a paired end experiment.

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87714938.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

snacktavish commented 9 years ago

"All tests successfully passed."

snacktavish commented 9 years ago

Hmmm - maybe it is an issue with this sra file. I'm haven't figured out what is wrong with it, but format_sra.pl running fine on a different one.

snacktavish commented 9 years ago

This is the one I couldn't get to run - SRR610374 http://www.ncbi.nlm.nih.gov/sra/SRX202248%5Baccn%5D

juliema commented 9 years ago

hmm yeah that could be. Glad it is working with another file!

will you just do

head SRR610374.fastq

and

tail SRR610374.fastq

and see what it looks like?

even print it here.

Julie

On Mon, Mar 30, 2015 at 11:16 AM, Emily Jane McTavish < notifications@github.com> wrote:

Hmmm - maybe it is an issue with this sra file. I'm haven't figured out what is wrong with it, but format_sra.pl running fine on a different one.

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87739472.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

snacktavish commented 9 years ago

$head SRR610374.fastq @SRR610374.1 708:2:1:0:185 length=152 NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGACGAAGCTACGATTTCGCAAATCAATTATCTCTTAATATTTAGTTGGTTCCAGTAAACATAATGTTGGGGCACTTTG +SRR610374.1 708:2:1:0:185 length=152 DKMMXXOXWWUYWWXYVXNVXVMU[[VOSWXNVYP[Y[YRGSWWYXW[SVW[WQUYV[YTYWYYYUU[YZYUY[[VI_bV\bXIV^Xbaab[F]Z_abSOZababaaaaZPabaaYbb_ba^_baaa]_XbSaLPbaa_a_aJJa^a @SRR610374.2 708:2:1:0:1322 length=152 NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAACAGAAACATTTATTGGTTCAAACTGAAGCAGCTATAAAATATGCTCGTCGAACGGTAATTATCGCTTGGATTCTTC +SRR610374.2 708:2:1:0:1322 length=152 DNPVUUX[VUWXUW[TVUNYUZZYZTVXXZY[ZXYXXWXQTTUTVYYVPWSXX[QOS[[Y[X[[[YYY[[YXUTBBa\ba^aaZa[_Xa^[^aaaYaaYaaaYaTJV^aa\a^Y\a_WNXY[TVGV[N]aY^]`XJOS_V[YW @SRR610374.3 708:2:1:0:1333 length=152 NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCCGTTTCGCAGGCAGTAAAAGAGTCACAAGTTGAAACTGGGACTGGTTTAACTGCTTTAATATCGTCAACTAAGCGAT

$tail SRR610374.fastq +SRR610374.22668562 708:2:120:1788:1734 length=152 aaaa]^D\a_aaaa[^a]`^^^`][[[XZ_aBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBaaaaaaa\aVBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668563 708:2:120:1788:1774 length=152 GCTACATNAGCAAATAGCGAGAGCTAGAAAAGATCATGCCTACANTACNGNCCNNNNNNNNNNNNATACTGGCAAAAAGACTTATTTAGTCCTNNNTTNNNNNNNNNTNCATTANGNNNNNNCTTACCCGTCTCATCAGTTTTAACTTTATT +SRR610374.22668563 708:2:120:1788:1774 length=152 b]aaaXDW]aaa][[a`a^S^_]ZW_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBa\U\ba`abaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668564 708:2:120:1788:164 length=152 AGTCATGNAGTTAACTAAAGCTTCGGGTAAAAAACCCATCTTACNAAANTNATNNNNNNNNNNNNCGCCGTCCCGTACATTATTCGTGGAGAANNNCANNNNNNNNNCNCTGCCNANNNNNNCTTGCTCTATGAAGCATTAGGCGCAGNAGT +SRR610374.22668564 708:2:120:1788:164 length=152 [^a_X_WDWZabU[V[]aa]aa]Oa\WZaXY_`^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBaaUbb[abbaaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

juliema commented 9 years ago

Ok I can see a few things.

with the SRR610374.sra file it needs to be converted to either a fasta or fastq file and this can be done using SRA Toolkit.

http://www.ncbi.nlm.nih.gov/books/NBK158900/

With the file SRR610374.fastq - for some reason the names do not indicate which are the mate pairs. usually these files have something like this in the name

@SRR610374.22668564 708:2:120:1788:164 length=152\1

and

@SRR610374.22668564 708:2:120:1788:164 length=152\2

indicating that they are mate pairs.

can you show me the link to where you got this file?

we are getting there!

Julie

On Mon, Mar 30, 2015 at 11:34 AM, Emily Jane McTavish < notifications@github.com> wrote:

$head SRR610374.fastq @SRR610374.1 708:2:1:0:185 length=152

NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGACGAAGCTACGATTTCGCAAATCAATTATCTCTTAATATTTAGTTGGTTCCAGTAAACATAATGTTGGGGCACTTTG +SRR610374.1 708:2:1:0:185 length=152

DKMMXXOXWWUYWWXYVXNVXVMU[[VOSWXNVYP[Y[YRGSWWYXW[SVW[WQUYV[YTYWYYYUU[YZYUY[[VI_bV\bXIV^Xbaab[F]Z_abSOZababaaaaZPa baaYbb_ba^_baaa]

_XbSaLPbaa_a_aJJa^a @SRR610374.2 708:2:1:0:1322 length=152 NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAACAGAAACATTTATTGGTTCAAACTGAAGCAGCTATAAAATATGCTCGTCGAACGGTAATTATCGCTTGGATTCTTC +SRR610374.2 708:2:1:0:1322 length=152 DNPVUUX[VUWXUW[TVUNYUZZYZTVXXZY[ZXYXXWXQTTUTVYYVPWSXX[QOS[[Y[X[[[YYY[[YXUTBBa\ba^ `aaZa[_Xa^[^aaaYaaYaaaYaTJV^aa\a^Y\a_WNXY[TVGV[N]aY^]XJOS_V[YW @SRR610374.3 708:2:1:0:1333 length=152

NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCCGTTTCGCAGGCAGTAAAAGAGTCACAAGTTGAAACTGGGACTGGTTTAACTGCTTTAATATCGTCAACTAAGCGAT

$tail SRR610374.fastq +SRR610374.22668562 708:2:120:1788:1734 length=152 aaaa]^D\a_aaaa`[^a]^^^][[[XZ_aBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBaaaaaaa\a VBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668563 708:2:120:1788:1774 length=152

GCTACATNAGCAAATAGCGAGAGCTAGAAAAGATCATGCCTACANTACNGNCCNNNNNNNNNNNNATACTGGCAAAAAGACTTATTTAGTCCTNNNTTNNNNNNNNNTNCATTANGNNNNNNCTTACCCGTCTCATCAGTTTTAACTTTATT +SRR610374.22668563 708:2:120:1788:1774 length=152 b]aaaXDW]__aaa][[aa^S`^_]ZW_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBa\U\ba\ abaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668564 708:2:120:1788:164 length=152

AGTCATGNAGTTAACTAAAGCTTCGGGTAAAAAACCCATCTTACNAAANTNATNNNNNNNNNNNNCGCCGTCCCGTACATTATTCGTGGAGAANNNCANNNNNNNNNCNCTGCCNANNNNNNCTTGCTCTATGAAGCATTAGGCGCAGNAGT +SRR610374.22668564 708:2:120:1788:164 length=152 [^a_X_WDWZabU[V[]aa]aa]Oa_\WZa XY`^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBaaUbb[abbaaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87743768.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

snacktavish commented 9 years ago

It was from the ncbi link above (http://www.ncbi.nlm.nih.gov/sra/SRX202248%5Baccn%5D), and I used fastq-dump from the sra toolkit to make the fastq file. Thanks for your help,

juliema commented 9 years ago

Hi Emily,

Ok so I downloaded the experiment SRR610374 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=run_browser&run=SRR610374

looking at the file I think I can see where the paired end reads are. If we just print out the first few lines, it looks like they are interleaved, so the first read is the first pair and the second read is its mate and so on and so forth? Does that sound right? If so I will write you a little script to format it so aTRAM can read it. I think we will add a step to our software to check for paired-end reads in this format as well.

Julie

@SRR610374.1.1 708:2:1:0:185 length=76

NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGA

@SRR610374.1.2 708:2:1:0:185 length=76

CGAAGCTACGATTTCGCAAATCAATTATCTCTTAATATTTAGTTGGTTCCAGTAAACATAATGTTGGGGCACTTTG

@SRR610374.2.1 708:2:1:0:1322 length=76

NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAA

@SRR610374.2.2 708:2:1:0:1322 length=76

CAGAAACATTTATTGGTTCAAACTGAAGCAGCTATAAAATATGCTCGTCGAACGGTAATTATCGCTTGGATTCTTC

@SRR610374.3.1 708:2:1:0:1333 length=76

NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCC

@SRR610374.3.2 708:2:1:0:1333 length=76

GTTTCGCAGGCAGTAAAAGAGTCACAAGTTGAAACTGGGACTGGTTTAACTGCTTTAATATCGTCAACTAAGCGAT

@SRR610374.4.1 708:2:1:0:1607 length=76

NTTTCCATTGTAGGAGTTCTTACTGTCTTGATGGTGGCAACTTNAGTTCTATCTTTAGATTTTTTTCGTTTGGCTG

@SRR610374.4.2 708:2:1:0:1607 length=76

CCACAATGGAATCATTAATCCCTACTCCGCTCAAGCCTACTGTACCGATAATTGGATTAAAGCCAAAAGGATAGCC

@SRR610374.5.1 708:2:1:0:932 length=76

NGCAAAAAATTGAGTAAATATTTTAACCTCAACATTATCTAAATCTCCATTAATAGTATTTATTTATGAATAGTAA

@SRR610374.5.2 708:2:1:0:932 length=76

TGTACCTTTAACATCAAATCGTTAAGAAAATTGTGATTGGAAGTAACCAAACTCTGCGCTCTATGTCAGCTACTTG

@SRR610374.6.1 708:2:1:0:1135 length=76

NGGCGTGAACAGTAGTATAGGGGACTGGGAGAGAATGACAAATTTAACTTCTAATATTTAACTTTTGACTTATTTC

@SRR610374.6.2 708:2:1:0:1135 length=76

TTCCTGTCGCCAATCCAACAATTACCTAACATGATTTTTTTACTTATCTTTTTCATTAATTAGGGATTAGAGATTG

@SRR610374.7.1 708:2:1:0:1911 length=76

NGACCAAGGAATTAACTCTGCCATAATTACCCATCGGTTTTCTGAGGACAACTTACTTTCAAAGGGAAATTAAAAA

@SRR610374.7.2 708:2:1:0:1911 length=76

CGCAGAAAAAAGTTAAGTAAAAATGGCGAAAAACCCCTAAAATTGGTAAAAGACCCTTGCACTTATCTGAGCAAGA

@SRR610374.8.1 708:2:1:0:766 length=76

NGACTAATTAGATCTAGATCGGGAGCAGTGTCTCCTGGTAATACAAATAAAGCCGCACCATTCGACTCAATAGTTT

@SRR610374.8.2 708:2:1:0:766 length=76

ACTTTTGGGCAATAATTTTCCTGAGATTAGAGTAGTCAACTTATTGAATCTACAACAACAATTCCGTATAGATACA

@SRR610374.9.1 708:2:1:0:1143 length=76

NGAGGATTGCAATACCATTGATTAAGAAATATTCTATAATCCTGTGATTCTAATTTTTCTGCCAGTTTTAAGTCTT

@SRR610374.9.2 708:2:1:0:1143 length=76

ATTTATTTTCCTCATTATTTTAAAAAGGTTATCTTAGAATCTACTTCTCCAGGATTAGCAACTAAAACAGAGCGCG

@SRR610374.10.1 708:2:1:0:1002 length=76

NTTTTGGTATGTAATTAAGCAGCAAGATGGAACTTGCCAAATAGCCGATTTTGATACTCATCAGCCAAAAACATCA

@SRR610374.10.2 708:2:1:0:1002 length=76

GAGGTTTACACTTTCCTGCTCGAATTAAACCTATTTTCTTGGCTATTGCTTCTTGTTCAGTTTCGTCAGCTCCCCC

On Mon, Mar 30, 2015 at 11:47 AM, Emily Jane McTavish < notifications@github.com> wrote:

It was from the ncbi link above ( http://www.ncbi.nlm.nih.gov/sra/SRX202248%5Baccn%5D), and I used fastq-dump from the sra toolkit to make the fastq file. Thanks for your help,

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87746586.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

juliema commented 9 years ago

Hi Emily,

I went ahead and wrote a script to edit the file accordingly. It is in:

https://github.com/juliema/phylogenomic_pipeline/blob/master/formatfastfile.pl

if you download that and run it it will edit your file and print it to a new output file:

useage

perl formatfastfile.pl inputfile outputfilename

hope this solves your problem. I am putting this as a bug to fix in aTRAM for other types of fasta formatted files. Thanks for your help!

Julie

On Mon, Mar 30, 2015 at 12:53 PM, Julie Allen juliema@illinois.edu wrote:

Hi Emily,

Ok so I downloaded the experiment SRR610374 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=run_browser&run=SRR610374

looking at the file I think I can see where the paired end reads are. If we just print out the first few lines, it looks like they are interleaved, so the first read is the first pair and the second read is its mate and so on and so forth? Does that sound right? If so I will write you a little script to format it so aTRAM can read it. I think we will add a step to our software to check for paired-end reads in this format as well.

Julie

@SRR610374.1.1 708:2:1:0:185 length=76

NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGA

@SRR610374.1.2 708:2:1:0:185 length=76

CGAAGCTACGATTTCGCAAATCAATTATCTCTTAATATTTAGTTGGTTCCAGTAAACATAATGTTGGGGCACTTTG

@SRR610374.2.1 708:2:1:0:1322 length=76

NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAA

@SRR610374.2.2 708:2:1:0:1322 length=76

CAGAAACATTTATTGGTTCAAACTGAAGCAGCTATAAAATATGCTCGTCGAACGGTAATTATCGCTTGGATTCTTC

@SRR610374.3.1 708:2:1:0:1333 length=76

NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCC

@SRR610374.3.2 708:2:1:0:1333 length=76

GTTTCGCAGGCAGTAAAAGAGTCACAAGTTGAAACTGGGACTGGTTTAACTGCTTTAATATCGTCAACTAAGCGAT

@SRR610374.4.1 708:2:1:0:1607 length=76

NTTTCCATTGTAGGAGTTCTTACTGTCTTGATGGTGGCAACTTNAGTTCTATCTTTAGATTTTTTTCGTTTGGCTG

@SRR610374.4.2 708:2:1:0:1607 length=76

CCACAATGGAATCATTAATCCCTACTCCGCTCAAGCCTACTGTACCGATAATTGGATTAAAGCCAAAAGGATAGCC

@SRR610374.5.1 708:2:1:0:932 length=76

NGCAAAAAATTGAGTAAATATTTTAACCTCAACATTATCTAAATCTCCATTAATAGTATTTATTTATGAATAGTAA

@SRR610374.5.2 708:2:1:0:932 length=76

TGTACCTTTAACATCAAATCGTTAAGAAAATTGTGATTGGAAGTAACCAAACTCTGCGCTCTATGTCAGCTACTTG

@SRR610374.6.1 708:2:1:0:1135 length=76

NGGCGTGAACAGTAGTATAGGGGACTGGGAGAGAATGACAAATTTAACTTCTAATATTTAACTTTTGACTTATTTC

@SRR610374.6.2 708:2:1:0:1135 length=76

TTCCTGTCGCCAATCCAACAATTACCTAACATGATTTTTTTACTTATCTTTTTCATTAATTAGGGATTAGAGATTG

@SRR610374.7.1 708:2:1:0:1911 length=76

NGACCAAGGAATTAACTCTGCCATAATTACCCATCGGTTTTCTGAGGACAACTTACTTTCAAAGGGAAATTAAAAA

@SRR610374.7.2 708:2:1:0:1911 length=76

CGCAGAAAAAAGTTAAGTAAAAATGGCGAAAAACCCCTAAAATTGGTAAAAGACCCTTGCACTTATCTGAGCAAGA

@SRR610374.8.1 708:2:1:0:766 length=76

NGACTAATTAGATCTAGATCGGGAGCAGTGTCTCCTGGTAATACAAATAAAGCCGCACCATTCGACTCAATAGTTT

@SRR610374.8.2 708:2:1:0:766 length=76

ACTTTTGGGCAATAATTTTCCTGAGATTAGAGTAGTCAACTTATTGAATCTACAACAACAATTCCGTATAGATACA

@SRR610374.9.1 708:2:1:0:1143 length=76

NGAGGATTGCAATACCATTGATTAAGAAATATTCTATAATCCTGTGATTCTAATTTTTCTGCCAGTTTTAAGTCTT

@SRR610374.9.2 708:2:1:0:1143 length=76

ATTTATTTTCCTCATTATTTTAAAAAGGTTATCTTAGAATCTACTTCTCCAGGATTAGCAACTAAAACAGAGCGCG

@SRR610374.10.1 708:2:1:0:1002 length=76

NTTTTGGTATGTAATTAAGCAGCAAGATGGAACTTGCCAAATAGCCGATTTTGATACTCATCAGCCAAAAACATCA

@SRR610374.10.2 708:2:1:0:1002 length=76

GAGGTTTACACTTTCCTGCTCGAATTAAACCTATTTTCTTGGCTATTGCTTCTTGTTCAGTTTCGTCAGCTCCCCC

On Mon, Mar 30, 2015 at 11:47 AM, Emily Jane McTavish < notifications@github.com> wrote:

It was from the ncbi link above ( http://www.ncbi.nlm.nih.gov/sra/SRX202248%5Baccn%5D), and I used fastq-dump from the sra toolkit to make the fastq file. Thanks for your help,

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87746586.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

snacktavish commented 9 years ago

If I run that script of the fastq-dump from the SRA, it (incorrectly) pulls out the first digit of the read number and treats it as the pairing information. If appears that the mate pairs are merged into single reads in the fastq-dump, thus the read length differences from 152 to 76. I can export them as separate reads using fastq-dump --split-files SRR610374.sra, and then append the read information and concatenate them, if that is what is necessary.

juliema commented 9 years ago

yeah I think that will be necessary. aTRAM takes the mate pairs and stores them differently so it needs to know which are which.

let me know if that works.

Julie

On Mon, Mar 30, 2015 at 1:47 PM, Emily Jane McTavish < notifications@github.com> wrote:

If I run that script of the fastq-dump from the SRA, it (incorrectly) pulls out the first digit of the read number and treats it as the pairing information. If appears that the mate pairs are merged into single reads in the fastq-dump, thus the read length differences from 152 to 76. I can export them as separate reads using fastq-dump --split-files SRR610374.sra, and then append the read information and concatenate them, if that is what is necessary.

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87788387.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

snacktavish commented 9 years ago

Would it be possible for format_sra.pl to accept the paired ends separately? e.g. SRR123_1.fq and SRR123_2.fq

juliema commented 9 years ago

yes definitely, it is just not set up for that right now. I can add a new issue to add that feature in. This might speed things up for people eh?

Right now if you just cat those two files together you should be good to go.

On Mon, Mar 30, 2015 at 2:07 PM, Emily Jane McTavish < notifications@github.com> wrote:

Would it be possible for format_sra.pl to accept the paired ends separately? e.g. SRR123_1.fq and SRR123_2.fq

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87793978.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

snacktavish commented 9 years ago

Sounds good. I just need to add a \1 (or \2) to the end of each header line, for it to properly recognize the mates, right? Will do and let you know how it goes. Thanks!

juliema commented 9 years ago

yep exactly. thanks!

On Mon, Mar 30, 2015 at 2:12 PM, Emily Jane McTavish < notifications@github.com> wrote:

Sounds good. I just need to add a \1 (or \2) to the end of each header line, for it to properly recognize the mates, right? Will do and let you know how it goes. Thanks!

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87795434.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

snacktavish commented 9 years ago

Tried it, same results: still creates files and hangs.

Log file: 150331 11:57:27: Running format_sra.pl -input SRR610374.fastq -out test2_atram_db, v1.01+ 150331 11:57:27: SRR610374.fastq is 11559.30 MB; we will make 23 shards. 150331 11:57:27: Dividing fasta/fastq file into buckets for sorting.

To replicate:

a=SRR610374.sra;
stub=${a%.*}; 
fastq-dump --split-files $a;
sed -e 's/length=\([0-9]*\)$/length=\1\\1/' ${stub}_1.fastq > ${stub}c_1.fastq;
sed -e 's/length=\([0-9]*\)$/length=\1\\2/' ${stub}_2.fastq > ${stub}c_2.fastq;
cat ${stub}c_1.fastq ${stub}c_2.fastq > ${stub}.fastq;
./format_sra.pl -input ${stub}.fastq -out test2_atram_db;
juliema commented 9 years ago

ok let me try and see what I get. Do you mind just showing the head of the SRR610374.fastq file?

On Tue, Mar 31, 2015 at 5:31 AM, Emily Jane McTavish < notifications@github.com> wrote:

Tried it, same results: still creates files and hangs.

Log file: 150331 11:57:27: Running format_sra.pl -input SRR610374.fastq -out test2_atram_db, v1.01+ 150331 11:57:27: SRR610374.fastq is 11559.30 MB; we will make 23 shards. 150331 11:57:27: Dividing fasta/fastq file into buckets for sorting.

To replicate:

a=SRR610374.sra; stub=${a%.

}; fastq-dump --split-files $a; sed -e 's/length=([0-9])$/length=\1\1/' ${stub}_1.fastq > ${stub}c_1.fastq; sed -e 's/length=([0-9]*)$/length=\1\2/' ${stub}_2.fastq > ${stub}c_2.fastq; cat ${stub}c_1.fastq ${stub}c_2.fastq > ${stub}.fastq; ./format_sra.pl -input ${stub}.fastq -out test2_atram_db;

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-88034010.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

juliema commented 9 years ago

Hey Emily,

I am still running your script but I found one thing.

the sed lines were not printing out the /1 and /2 at the end.

I am running these lines and then going to check format_sra.pl

change them to this: sed -E 's/length=([0-9])*$/length=\1\/1/' ${stub}_1.fastq >${stub}c_1.fastq;

sed -E 's/length=([0-9])*$/length=\1\/2/' ${stub}_2.fastq >${stub}c_2.fastq;

On Tue, Mar 31, 2015 at 9:47 AM, Julie Allen juliema@illinois.edu wrote:

ok let me try and see what I get. Do you mind just showing the head of the SRR610374.fastq file?

On Tue, Mar 31, 2015 at 5:31 AM, Emily Jane McTavish < notifications@github.com> wrote:

Tried it, same results: still creates files and hangs.

Log file: 150331 11:57:27: Running format_sra.pl -input SRR610374.fastq -out test2_atram_db, v1.01+ 150331 11:57:27: SRR610374.fastq is 11559.30 MB; we will make 23 shards. 150331 11:57:27: Dividing fasta/fastq file into buckets for sorting.

To replicate:

a=SRR610374.sra; stub=${a%.

}; fastq-dump --split-files $a; sed -e 's/length=([0-9])$/length=\1\1/' ${stub}_1.fastq > ${stub}c_1.fastq; sed -e 's/length=([0-9]*)$/length=\1\2/' ${stub}_2.fastq > ${stub}c_2.fastq; cat ${stub}c_1.fastq ${stub}c_2.fastq > ${stub}.fastq; ./format_sra.pl -input ${stub}.fastq -out test2_atram_db;

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-88034010.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

juliema commented 9 years ago

Ok I have format.sra running on that file. I have not hit any snags yet and it seems like it is going through fine.

Let me know if this solves your issue so I can close this issue in github.

Thanks for using our program!

Julie

On Tue, Mar 31, 2015 at 12:26 PM, Julie Allen juliema@illinois.edu wrote:

Hey Emily,

I am still running your script but I found one thing.

the sed lines were not printing out the /1 and /2 at the end.

I am running these lines and then going to check format_sra.pl

change them to this: sed -E 's/length=([0-9])*$/length=\1\/1/' ${stub}_1.fastq

${stub}c_1.fastq;

sed -E 's/length=([0-9])*$/length=\1\/2/' ${stub}_2.fastq

${stub}c_2.fastq;

On Tue, Mar 31, 2015 at 9:47 AM, Julie Allen juliema@illinois.edu wrote:

ok let me try and see what I get. Do you mind just showing the head of the SRR610374.fastq file?

On Tue, Mar 31, 2015 at 5:31 AM, Emily Jane McTavish < notifications@github.com> wrote:

Tried it, same results: still creates files and hangs.

Log file: 150331 11:57:27: Running format_sra.pl -input SRR610374.fastq -out test2_atram_db, v1.01+ 150331 11:57:27: SRR610374.fastq is 11559.30 MB; we will make 23 shards. 150331 11:57:27: Dividing fasta/fastq file into buckets for sorting.

To replicate:

a=SRR610374.sra; stub=${a%.

}; fastq-dump --split-files $a; sed -e 's/length=([0-9])$/length=\1\1/' ${stub}_1.fastq > ${stub}c_1.fastq; sed -e 's/length=([0-9]*)$/length=\1\2/' ${stub}_2.fastq > ${stub}c_2.fastq; cat ${stub}c_1.fastq ${stub}c_2.fastq > ${stub}.fastq; ./format_sra.pl -input ${stub}.fastq -out test2_atram_db;

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-88034010.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

snacktavish commented 9 years ago

Ah, the sed difference is are linux/mac, was succeeding on my machine. But format_sra.pl is still not running. sadly.

The beginning of my file is: $head SRR610374.fastq @SRR610374.1 708:2:1:0:185 length=76\1 NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGA +SRR610374.1 708:2:1:0:185 length=76\1 DKMMXXOXWWUYWWXYVXNVXVMU[[VOSWXNVYP[Y[YRGSWWYXW[SVW[WQUYV[YTYWYYYUU[YZYUY[[V @SRR610374.2 708:2:1:0:1322 length=76\1 NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAA +SRR610374.2 708:2:1:0:1322 length=76\1 DNPVUUX[VUWXUW[TVUNYUZZYZTVXXZY[ZXYXXWXQTTUTVYYVPWSXX[QOS[[Y[X[[[YYY[[YXUTBB @SRR610374.3 708:2:1:0:1333 length=76\1 NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCC

and the end: tail SRR610374.fastq +SRR610374.22668562 708:2:120:1788:1734 length=76\2 aaaaaaa\aVBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668563 708:2:120:1788:1774 length=76\2 AAGACTTATTTAGTCCTNNNTTNNNNNNNNNTNCATTANGNNNNNNCTTACCCGTCTCATCAGTTTTAACTTTATT +SRR610374.22668563 708:2:120:1788:1774 length=76\2 a\U\ba`abaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668564 708:2:120:1788:164 length=76\2 ACATTATTCGTGGAGAANNNCANNNNNNNNNCNCTGCCNANNNNNNCTTGCTCTATGAAGCATTAGGCGCAGNAGT +SRR610374.22668564 708:2:120:1788:164 length=76\2 aaUbb[abbaaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

juliema commented 9 years ago

yeah mine just threw an error as well. What is going on with this file I wonder.

let me dig a bit more.

On Tue, Mar 31, 2015 at 2:15 PM, Emily Jane McTavish < notifications@github.com> wrote:

Ah, the sed difference is are linux/mac, was succeeding on my machine. But format_sra.pl is still not running. sadly.

The beginning of my file is: $head SRR610374.fastq @SRR610374.1 708:2:1:0:185 length=76\1

NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGA +SRR610374.1 708:2:1:0:185 length=76\1

DKMMXXOXWWUYWWXYVXNVXVMU[[VOSWXNVYP[Y[YRGSWWYXW[SVW[WQUYV[YTYWYYYUU[YZYUY[[V @SRR610374.2 708:2:1:0:1322 length=76\1

NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAA +SRR610374.2 708:2:1:0:1322 length=76\1

DNPVUUX[VUWXUW[TVUNYUZZYZTVXXZY[ZXYXXWXQTTUTVYYVPWSXX[QOS[[Y[X[[[YYY[[YXUTBB @SRR610374.3 708:2:1:0:1333 length=76\1

NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCC

and the end: tail SRR610374.fastq +SRR610374.22668562 708:2:120:1788:1734 length=76\2 aaaaaaa\a VBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668563 708:2:120:1788:1774 length=76\2

AAGACTTATTTAGTCCTNNNTTNNNNNNNNNTNCATTANGNNNNNNCTTACCCGTCTCATCAGTTTTAACTTTATT +SRR610374.22668563 708:2:120:1788:1774 length=76\2 a\U\ba`abaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668564 708:2:120:1788:164 length=76\2

ACATTATTCGTGGAGAANNNCANNNNNNNNNCNCTGCCNANNNNNNCTTGCTCTATGAAGCATTAGGCGCAGNAGT +SRR610374.22668564 708:2:120:1788:164 length=76\2

aaUbb[abbaaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-88214517.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

juliema commented 9 years ago

Hi Emily,

Ok so I found one work around with this dataset. I am still not sure what is going on it seems to stop adding sequences to the buckets and ends up with empty files, this suggests there is something strange with the file. I have been checking format.sra on a bunch of datasets and it seems to work fine.

However for now I got this file to work by first turning the .sra file into a fasta file, adding the /1 and /2 onto it and then running format.sra.pl using only one shard. This will give you the aTRAM database in one single big databse, the blast will take a bit longer but it will work.

a=SRR610374.sra;

stub=${a%.};

./fastq-dump --fasta --split-files $a; sed -E 's/length=([0-9])*$/length=\1\/1/' ${stub}_1.fasta >${stub}c_1.fasta;

sed -E 's/length=([0-9])*$/length=\1\/2/' ${stub}_2.fasta >${stub}c_2.fasta;

cat ${stub}c_1.fasta ${stub}c_2.fasta > ${stub}.fasta;

format_sra.pl -input ${stub}.fasta -numshards 1 -output FASTA.test

On Tue, Mar 31, 2015 at 2:18 PM, Julie Allen juliema@illinois.edu wrote:

yeah mine just threw an error as well. What is going on with this file I wonder.

let me dig a bit more.

On Tue, Mar 31, 2015 at 2:15 PM, Emily Jane McTavish < notifications@github.com> wrote:

Ah, the sed difference is are linux/mac, was succeeding on my machine. But format_sra.pl is still not running. sadly.

The beginning of my file is: $head SRR610374.fastq @SRR610374.1 708:2:1:0:185 length=76\1

NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGA +SRR610374.1 708:2:1:0:185 length=76\1

DKMMXXOXWWUYWWXYVXNVXVMU[[VOSWXNVYP[Y[YRGSWWYXW[SVW[WQUYV[YTYWYYYUU[YZYUY[[V @SRR610374.2 708:2:1:0:1322 length=76\1

NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAA +SRR610374.2 708:2:1:0:1322 length=76\1

DNPVUUX[VUWXUW[TVUNYUZZYZTVXXZY[ZXYXXWXQTTUTVYYVPWSXX[QOS[[Y[X[[[YYY[[YXUTBB @SRR610374.3 708:2:1:0:1333 length=76\1

NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCC

and the end: tail SRR610374.fastq +SRR610374.22668562 708:2:120:1788:1734 length=76\2 aaaaaaa\a VBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668563 708:2:120:1788:1774 length=76\2

AAGACTTATTTAGTCCTNNNTTNNNNNNNNNTNCATTANGNNNNNNCTTACCCGTCTCATCAGTTTTAACTTTATT +SRR610374.22668563 708:2:120:1788:1774 length=76\2 a\U\ba`abaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668564 708:2:120:1788:164 length=76\2

ACATTATTCGTGGAGAANNNCANNNNNNNNNCNCTGCCNANNNNNNCTTGCTCTATGAAGCATTAGGCGCAGNAGT +SRR610374.22668564 708:2:120:1788:164 length=76\2

aaUbb[abbaaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-88214517.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com

snacktavish commented 9 years ago

Hmm, I tried that, but still no luck.

$ ./format_sra.pl -input ${stub}.fasta -number 1 -output FASTA.test

Building a new DB, current time: 04/09/2015 11:56:13
New DB name:   /home/ejmctavish/projects/Exelixis/aTRAM/FASTA.test.0.db
New DB title:  /home/ejmctavish/projects/Exelixis/aTRAM/FASTA.test.0.1.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
BLAST options error: File /home/ejmctavish/projects/Exelixis/aTRAM/FASTA.test.0.1.fasta is empty
3 at /home/ejmctavish/projects/Exelixis/aTRAM/lib/System.pm line 76.
daisieh commented 9 years ago

I think we should add an ability to use separate end files...I'll get on that. The code is already existing in other places, so it shouldn't take long.

daisieh commented 9 years ago

I think this will be resolved with PR #157. I'm running the file right now to be sure, but it seems to work for a subset.

daisieh commented 9 years ago

It works now!

@snacktavish, when you do the fastq-dump, make sure you use the --split-files option so that you get separate paired reads, and then use the new version in the #157 PR as: format_sra.pl -1 SRR610374_1.fastq -2 SRR610374_2.fastq -out my_atram_db

snacktavish commented 9 years ago

Success! (at least as far as the database construction). Thanks!

daisieh commented 9 years ago

Yay!

juliema commented 9 years ago

great!!

On Fri, Apr 24, 2015 at 1:16 PM, Daisie Huang notifications@github.com wrote:

Yay!

— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-96021089.

Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com