Closed snacktavish closed 9 years ago
Hi Emily,
Quick question: do you have paired end reads? If so did you concatenate them together into the SRR610374.fastq file? If not that is the problem, the program is looking for the both mate pairs.
let me know if that fixes things, if not we will dig further.
thanks!
Julie
On Mon, Mar 30, 2015 at 7:17 AM, Emily Jane McTavish < notifications@github.com> wrote:
Running format_sra.pl -input SRR610374.fastq -out my_atram_db, creates a bunch of empty .bucket and .sorted files and then just hangs. Seems to get further if I run it on an sra file, (format_sra.pl -input SRR610374.sra -out my_atram_db), and creates .bucket files that are not empty, but also hangs, and all the .sorted are empty.
In both cases last line of log is: Dividing fasta/fastq file into buckets for sorting.
Advice? Thanks, this is a great idea and I look forward to using it!
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Yes, that file should have both paired ends. It is just a fastq-dump of an sra from a paired end experiment.
ok will you go into the test folder and run:
perl test_all.pl -debug
this will go through all the steps of aTRAM to make sure it is working properly. Let me know what output you get.
J
On Mon, Mar 30, 2015 at 10:01 AM, Emily Jane McTavish < notifications@github.com> wrote:
Yes, that file should have both paired ends. It is just a fastq-dump of an sra from a paired end experiment.
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87714938.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
"All tests successfully passed."
Hmmm - maybe it is an issue with this sra file. I'm haven't figured out what is wrong with it, but format_sra.pl running fine on a different one.
This is the one I couldn't get to run - SRR610374 http://www.ncbi.nlm.nih.gov/sra/SRX202248%5Baccn%5D
hmm yeah that could be. Glad it is working with another file!
will you just do
head SRR610374.fastq
and
tail SRR610374.fastq
and see what it looks like?
even print it here.
Julie
On Mon, Mar 30, 2015 at 11:16 AM, Emily Jane McTavish < notifications@github.com> wrote:
Hmmm - maybe it is an issue with this sra file. I'm haven't figured out what is wrong with it, but format_sra.pl running fine on a different one.
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87739472.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
$head SRR610374.fastq
@SRR610374.1 708:2:1:0:185 length=152
NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGACGAAGCTACGATTTCGCAAATCAATTATCTCTTAATATTTAGTTGGTTCCAGTAAACATAATGTTGGGGCACTTTG
+SRR610374.1 708:2:1:0:185 length=152
DKMMXXOXWWUYWWXYVXNVXVMU[[VOSWXNVYP[Y[YRGSWWYXW[SVW[WQUYV[YTYWYYYUU[YZYUY[[VI_bV\bXIV^Xbaab[F]Z_abSOZababaaaaZPabaaYbb_ba^_b
aaa]_XbSaLPbaa_a_aJJa^
a
@SRR610374.2 708:2:1:0:1322 length=152
NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAACAGAAACATTTATTGGTTCAAACTGAAGCAGCTATAAAATATGCTCGTCGAACGGTAATTATCGCTTGGATTCTTC
+SRR610374.2 708:2:1:0:1322 length=152
DNPVUUX[VUWXUW[TVUNYUZZYZTVXXZY[ZXYXXWXQTTUTVYYVPWSXX[QOS[[Y[X[[[YYY[[YXUTBBa\ba^aaZa[
_Xa^[^aaaYa
aYaaaYaTJV^aa\a
^Y\a_WNXY[TVGV[N]aY^]`XJOS_V[YW
@SRR610374.3 708:2:1:0:1333 length=152
NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCCGTTTCGCAGGCAGTAAAAGAGTCACAAGTTGAAACTGGGACTGGTTTAACTGCTTTAATATCGTCAACTAAGCGAT
$tail SRR610374.fastq
+SRR610374.22668562 708:2:120:1788:1734 length=152
aaaa]^D\
a_aaa
a[^a]`^^^`]
[[[XZ_aBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBaaaaaaa\aVBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668563 708:2:120:1788:1774 length=152 GCTACATNAGCAAATAGCGAGAGCTAGAAAAGATCATGCCTACANTACNGNCCNNNNNNNNNNNNATACTGGCAAAAAGACTTATTTAGTCCTNNNTTNNNNNNNNNTNCATTANGNNNNNNCTTACCCGTCTCATCAGTTTTAACTTTATT +SRR610374.22668563 708:2:120:1788:1774 length=152
b]aaaXDW]
aaa][[a`a^S
^_
]ZW_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
a\U\ba`abaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@SRR610374.22668564 708:2:120:1788:164 length=152
AGTCATGNAGTTAACTAAAGCTTCGGGTAAAAAACCCATCTTACNAAANTNATNNNNNNNNNNNNCGCCGTCCCGTACATTATTCGTGGAGAANNNCANNNNNNNNNCNCTGCCNANNNNNNCTTGCTCTATGAAGCATTAGGCGCAGNAGT
+SRR610374.22668564 708:2:120:1788:164 length=152
[^a_X_WDWZabU[V[]aa]aa]Oa\WZa
XY_`^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBaaUbb[abbaaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
Ok I can see a few things.
with the SRR610374.sra file it needs to be converted to either a fasta or fastq file and this can be done using SRA Toolkit.
http://www.ncbi.nlm.nih.gov/books/NBK158900/
With the file SRR610374.fastq - for some reason the names do not indicate which are the mate pairs. usually these files have something like this in the name
@SRR610374.22668564 708:2:120:1788:164 length=152\1
and
@SRR610374.22668564 708:2:120:1788:164 length=152\2
indicating that they are mate pairs.
can you show me the link to where you got this file?
we are getting there!
Julie
On Mon, Mar 30, 2015 at 11:34 AM, Emily Jane McTavish < notifications@github.com> wrote:
$head SRR610374.fastq @SRR610374.1 708:2:1:0:185 length=152
NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGACGAAGCTACGATTTCGCAAATCAATTATCTCTTAATATTTAGTTGGTTCCAGTAAACATAATGTTGGGGCACTTTG +SRR610374.1 708:2:1:0:185 length=152
DKMMXXOXWWUYWWXYVXNVXVMU[[VOSWXNVYP[Y[YRGSWWYXW[SVW[WQUYV[YTYWYYYUU[YZYUY[[VI_bV\bXIV^Xbaab[F]Z_abSOZababaaaaZPa baaYbb_ba^_baaa]
_XbSaLPbaa_a_aJJa^a @SRR610374.2 708:2:1:0:1322 length=152 NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAACAGAAACATTTATTGGTTCAAACTGAAGCAGCTATAAAATATGCTCGTCGAACGGTAATTATCGCTTGGATTCTTC +SRR610374.2 708:2:1:0:1322 length=152 DNPVUUX[VUWXUW[TVUNYUZZYZTVXXZY[ZXYXXWXQTTUTVYYVPWSXX[QOS[[Y[X[[[YYY[[YXUTBBa\ba^ `aaZa[_Xa^[^aaaYaaYaaaYaTJV^aa\a^Y\a_WNXY[TVGV[N]aY^]XJOS_V[YW @SRR610374.3 708:2:1:0:1333 length=152
NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCCGTTTCGCAGGCAGTAAAAGAGTCACAAGTTGAAACTGGGACTGGTTTAACTGCTTTAATATCGTCAACTAAGCGAT
$tail SRR610374.fastq +SRR610374.22668562 708:2:120:1788:1734 length=152 aaaa]^D\a_aaaa`[^a]^^^][[[XZ_aBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBaaaaaaa\a VBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668563 708:2:120:1788:1774 length=152
GCTACATNAGCAAATAGCGAGAGCTAGAAAAGATCATGCCTACANTACNGNCCNNNNNNNNNNNNATACTGGCAAAAAGACTTATTTAGTCCTNNNTTNNNNNNNNNTNCATTANGNNNNNNCTTACCCGTCTCATCAGTTTTAACTTTATT +SRR610374.22668563 708:2:120:1788:1774 length=152 b]aaaXDW]__aaa][[aa^S`^_]ZW_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBa\U\ba\ abaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668564 708:2:120:1788:164 length=152
AGTCATGNAGTTAACTAAAGCTTCGGGTAAAAAACCCATCTTACNAAANTNATNNNNNNNNNNNNCGCCGTCCCGTACATTATTCGTGGAGAANNNCANNNNNNNNNCNCTGCCNANNNNNNCTTGCTCTATGAAGCATTAGGCGCAGNAGT +SRR610374.22668564 708:2:120:1788:164 length=152 [^a_X_WDWZabU[V[]aa]aa]Oa_\WZa XY`^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBaaUbb[abbaaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87743768.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
It was from the ncbi link above (http://www.ncbi.nlm.nih.gov/sra/SRX202248%5Baccn%5D), and I used fastq-dump from the sra toolkit to make the fastq file. Thanks for your help,
Hi Emily,
Ok so I downloaded the experiment SRR610374 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=run_browser&run=SRR610374
looking at the file I think I can see where the paired end reads are. If we just print out the first few lines, it looks like they are interleaved, so the first read is the first pair and the second read is its mate and so on and so forth? Does that sound right? If so I will write you a little script to format it so aTRAM can read it. I think we will add a step to our software to check for paired-end reads in this format as well.
Julie
@SRR610374.1.1 708:2:1:0:185 length=76
NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGA
@SRR610374.1.2 708:2:1:0:185 length=76
CGAAGCTACGATTTCGCAAATCAATTATCTCTTAATATTTAGTTGGTTCCAGTAAACATAATGTTGGGGCACTTTG
@SRR610374.2.1 708:2:1:0:1322 length=76
NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAA
@SRR610374.2.2 708:2:1:0:1322 length=76
CAGAAACATTTATTGGTTCAAACTGAAGCAGCTATAAAATATGCTCGTCGAACGGTAATTATCGCTTGGATTCTTC
@SRR610374.3.1 708:2:1:0:1333 length=76
NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCC
@SRR610374.3.2 708:2:1:0:1333 length=76
GTTTCGCAGGCAGTAAAAGAGTCACAAGTTGAAACTGGGACTGGTTTAACTGCTTTAATATCGTCAACTAAGCGAT
@SRR610374.4.1 708:2:1:0:1607 length=76
NTTTCCATTGTAGGAGTTCTTACTGTCTTGATGGTGGCAACTTNAGTTCTATCTTTAGATTTTTTTCGTTTGGCTG
@SRR610374.4.2 708:2:1:0:1607 length=76
CCACAATGGAATCATTAATCCCTACTCCGCTCAAGCCTACTGTACCGATAATTGGATTAAAGCCAAAAGGATAGCC
@SRR610374.5.1 708:2:1:0:932 length=76
NGCAAAAAATTGAGTAAATATTTTAACCTCAACATTATCTAAATCTCCATTAATAGTATTTATTTATGAATAGTAA
@SRR610374.5.2 708:2:1:0:932 length=76
TGTACCTTTAACATCAAATCGTTAAGAAAATTGTGATTGGAAGTAACCAAACTCTGCGCTCTATGTCAGCTACTTG
@SRR610374.6.1 708:2:1:0:1135 length=76
NGGCGTGAACAGTAGTATAGGGGACTGGGAGAGAATGACAAATTTAACTTCTAATATTTAACTTTTGACTTATTTC
@SRR610374.6.2 708:2:1:0:1135 length=76
TTCCTGTCGCCAATCCAACAATTACCTAACATGATTTTTTTACTTATCTTTTTCATTAATTAGGGATTAGAGATTG
@SRR610374.7.1 708:2:1:0:1911 length=76
NGACCAAGGAATTAACTCTGCCATAATTACCCATCGGTTTTCTGAGGACAACTTACTTTCAAAGGGAAATTAAAAA
@SRR610374.7.2 708:2:1:0:1911 length=76
CGCAGAAAAAAGTTAAGTAAAAATGGCGAAAAACCCCTAAAATTGGTAAAAGACCCTTGCACTTATCTGAGCAAGA
@SRR610374.8.1 708:2:1:0:766 length=76
NGACTAATTAGATCTAGATCGGGAGCAGTGTCTCCTGGTAATACAAATAAAGCCGCACCATTCGACTCAATAGTTT
@SRR610374.8.2 708:2:1:0:766 length=76
ACTTTTGGGCAATAATTTTCCTGAGATTAGAGTAGTCAACTTATTGAATCTACAACAACAATTCCGTATAGATACA
@SRR610374.9.1 708:2:1:0:1143 length=76
NGAGGATTGCAATACCATTGATTAAGAAATATTCTATAATCCTGTGATTCTAATTTTTCTGCCAGTTTTAAGTCTT
@SRR610374.9.2 708:2:1:0:1143 length=76
ATTTATTTTCCTCATTATTTTAAAAAGGTTATCTTAGAATCTACTTCTCCAGGATTAGCAACTAAAACAGAGCGCG
@SRR610374.10.1 708:2:1:0:1002 length=76
NTTTTGGTATGTAATTAAGCAGCAAGATGGAACTTGCCAAATAGCCGATTTTGATACTCATCAGCCAAAAACATCA
@SRR610374.10.2 708:2:1:0:1002 length=76
GAGGTTTACACTTTCCTGCTCGAATTAAACCTATTTTCTTGGCTATTGCTTCTTGTTCAGTTTCGTCAGCTCCCCC
On Mon, Mar 30, 2015 at 11:47 AM, Emily Jane McTavish < notifications@github.com> wrote:
It was from the ncbi link above ( http://www.ncbi.nlm.nih.gov/sra/SRX202248%5Baccn%5D), and I used fastq-dump from the sra toolkit to make the fastq file. Thanks for your help,
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87746586.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Hi Emily,
I went ahead and wrote a script to edit the file accordingly. It is in:
https://github.com/juliema/phylogenomic_pipeline/blob/master/formatfastfile.pl
if you download that and run it it will edit your file and print it to a new output file:
useage
perl formatfastfile.pl inputfile outputfilename
hope this solves your problem. I am putting this as a bug to fix in aTRAM for other types of fasta formatted files. Thanks for your help!
Julie
On Mon, Mar 30, 2015 at 12:53 PM, Julie Allen juliema@illinois.edu wrote:
Hi Emily,
Ok so I downloaded the experiment SRR610374 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=run_browser&run=SRR610374
looking at the file I think I can see where the paired end reads are. If we just print out the first few lines, it looks like they are interleaved, so the first read is the first pair and the second read is its mate and so on and so forth? Does that sound right? If so I will write you a little script to format it so aTRAM can read it. I think we will add a step to our software to check for paired-end reads in this format as well.
Julie
@SRR610374.1.1 708:2:1:0:185 length=76
NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGA
@SRR610374.1.2 708:2:1:0:185 length=76
CGAAGCTACGATTTCGCAAATCAATTATCTCTTAATATTTAGTTGGTTCCAGTAAACATAATGTTGGGGCACTTTG
@SRR610374.2.1 708:2:1:0:1322 length=76
NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAA
@SRR610374.2.2 708:2:1:0:1322 length=76
CAGAAACATTTATTGGTTCAAACTGAAGCAGCTATAAAATATGCTCGTCGAACGGTAATTATCGCTTGGATTCTTC
@SRR610374.3.1 708:2:1:0:1333 length=76
NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCC
@SRR610374.3.2 708:2:1:0:1333 length=76
GTTTCGCAGGCAGTAAAAGAGTCACAAGTTGAAACTGGGACTGGTTTAACTGCTTTAATATCGTCAACTAAGCGAT
@SRR610374.4.1 708:2:1:0:1607 length=76
NTTTCCATTGTAGGAGTTCTTACTGTCTTGATGGTGGCAACTTNAGTTCTATCTTTAGATTTTTTTCGTTTGGCTG
@SRR610374.4.2 708:2:1:0:1607 length=76
CCACAATGGAATCATTAATCCCTACTCCGCTCAAGCCTACTGTACCGATAATTGGATTAAAGCCAAAAGGATAGCC
@SRR610374.5.1 708:2:1:0:932 length=76
NGCAAAAAATTGAGTAAATATTTTAACCTCAACATTATCTAAATCTCCATTAATAGTATTTATTTATGAATAGTAA
@SRR610374.5.2 708:2:1:0:932 length=76
TGTACCTTTAACATCAAATCGTTAAGAAAATTGTGATTGGAAGTAACCAAACTCTGCGCTCTATGTCAGCTACTTG
@SRR610374.6.1 708:2:1:0:1135 length=76
NGGCGTGAACAGTAGTATAGGGGACTGGGAGAGAATGACAAATTTAACTTCTAATATTTAACTTTTGACTTATTTC
@SRR610374.6.2 708:2:1:0:1135 length=76
TTCCTGTCGCCAATCCAACAATTACCTAACATGATTTTTTTACTTATCTTTTTCATTAATTAGGGATTAGAGATTG
@SRR610374.7.1 708:2:1:0:1911 length=76
NGACCAAGGAATTAACTCTGCCATAATTACCCATCGGTTTTCTGAGGACAACTTACTTTCAAAGGGAAATTAAAAA
@SRR610374.7.2 708:2:1:0:1911 length=76
CGCAGAAAAAAGTTAAGTAAAAATGGCGAAAAACCCCTAAAATTGGTAAAAGACCCTTGCACTTATCTGAGCAAGA
@SRR610374.8.1 708:2:1:0:766 length=76
NGACTAATTAGATCTAGATCGGGAGCAGTGTCTCCTGGTAATACAAATAAAGCCGCACCATTCGACTCAATAGTTT
@SRR610374.8.2 708:2:1:0:766 length=76
ACTTTTGGGCAATAATTTTCCTGAGATTAGAGTAGTCAACTTATTGAATCTACAACAACAATTCCGTATAGATACA
@SRR610374.9.1 708:2:1:0:1143 length=76
NGAGGATTGCAATACCATTGATTAAGAAATATTCTATAATCCTGTGATTCTAATTTTTCTGCCAGTTTTAAGTCTT
@SRR610374.9.2 708:2:1:0:1143 length=76
ATTTATTTTCCTCATTATTTTAAAAAGGTTATCTTAGAATCTACTTCTCCAGGATTAGCAACTAAAACAGAGCGCG
@SRR610374.10.1 708:2:1:0:1002 length=76
NTTTTGGTATGTAATTAAGCAGCAAGATGGAACTTGCCAAATAGCCGATTTTGATACTCATCAGCCAAAAACATCA
@SRR610374.10.2 708:2:1:0:1002 length=76
GAGGTTTACACTTTCCTGCTCGAATTAAACCTATTTTCTTGGCTATTGCTTCTTGTTCAGTTTCGTCAGCTCCCCC
On Mon, Mar 30, 2015 at 11:47 AM, Emily Jane McTavish < notifications@github.com> wrote:
It was from the ncbi link above ( http://www.ncbi.nlm.nih.gov/sra/SRX202248%5Baccn%5D), and I used fastq-dump from the sra toolkit to make the fastq file. Thanks for your help,
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87746586.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
If I run that script of the fastq-dump from the SRA, it (incorrectly) pulls out the first digit of the read number and treats it as the pairing information. If appears that the mate pairs are merged into single reads in the fastq-dump, thus the read length differences from 152 to 76. I can export them as separate reads using fastq-dump --split-files SRR610374.sra, and then append the read information and concatenate them, if that is what is necessary.
yeah I think that will be necessary. aTRAM takes the mate pairs and stores them differently so it needs to know which are which.
let me know if that works.
Julie
On Mon, Mar 30, 2015 at 1:47 PM, Emily Jane McTavish < notifications@github.com> wrote:
If I run that script of the fastq-dump from the SRA, it (incorrectly) pulls out the first digit of the read number and treats it as the pairing information. If appears that the mate pairs are merged into single reads in the fastq-dump, thus the read length differences from 152 to 76. I can export them as separate reads using fastq-dump --split-files SRR610374.sra, and then append the read information and concatenate them, if that is what is necessary.
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87788387.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Would it be possible for format_sra.pl to accept the paired ends separately? e.g. SRR123_1.fq and SRR123_2.fq
yes definitely, it is just not set up for that right now. I can add a new issue to add that feature in. This might speed things up for people eh?
Right now if you just cat those two files together you should be good to go.
On Mon, Mar 30, 2015 at 2:07 PM, Emily Jane McTavish < notifications@github.com> wrote:
Would it be possible for format_sra.pl to accept the paired ends separately? e.g. SRR123_1.fq and SRR123_2.fq
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87793978.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Sounds good. I just need to add a \1 (or \2) to the end of each header line, for it to properly recognize the mates, right? Will do and let you know how it goes. Thanks!
yep exactly. thanks!
On Mon, Mar 30, 2015 at 2:12 PM, Emily Jane McTavish < notifications@github.com> wrote:
Sounds good. I just need to add a \1 (or \2) to the end of each header line, for it to properly recognize the mates, right? Will do and let you know how it goes. Thanks!
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-87795434.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Tried it, same results: still creates files and hangs.
Log file: 150331 11:57:27: Running format_sra.pl -input SRR610374.fastq -out test2_atram_db, v1.01+ 150331 11:57:27: SRR610374.fastq is 11559.30 MB; we will make 23 shards. 150331 11:57:27: Dividing fasta/fastq file into buckets for sorting.
To replicate:
a=SRR610374.sra;
stub=${a%.*};
fastq-dump --split-files $a;
sed -e 's/length=\([0-9]*\)$/length=\1\\1/' ${stub}_1.fastq > ${stub}c_1.fastq;
sed -e 's/length=\([0-9]*\)$/length=\1\\2/' ${stub}_2.fastq > ${stub}c_2.fastq;
cat ${stub}c_1.fastq ${stub}c_2.fastq > ${stub}.fastq;
./format_sra.pl -input ${stub}.fastq -out test2_atram_db;
ok let me try and see what I get. Do you mind just showing the head of the SRR610374.fastq file?
On Tue, Mar 31, 2015 at 5:31 AM, Emily Jane McTavish < notifications@github.com> wrote:
Tried it, same results: still creates files and hangs.
Log file: 150331 11:57:27: Running format_sra.pl -input SRR610374.fastq -out test2_atram_db, v1.01+ 150331 11:57:27: SRR610374.fastq is 11559.30 MB; we will make 23 shards. 150331 11:57:27: Dividing fasta/fastq file into buckets for sorting.
To replicate:
a=SRR610374.sra; stub=${a%.
}; fastq-dump --split-files $a; sed -e 's/length=([0-9])$/length=\1\1/' ${stub}_1.fastq > ${stub}c_1.fastq; sed -e 's/length=([0-9]*)$/length=\1\2/' ${stub}_2.fastq > ${stub}c_2.fastq; cat ${stub}c_1.fastq ${stub}c_2.fastq > ${stub}.fastq; ./format_sra.pl -input ${stub}.fastq -out test2_atram_db;
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-88034010.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Hey Emily,
I am still running your script but I found one thing.
the sed lines were not printing out the /1 and /2 at the end.
I am running these lines and then going to check format_sra.pl
change them to this: sed -E 's/length=([0-9])*$/length=\1\/1/' ${stub}_1.fastq >${stub}c_1.fastq;
sed -E 's/length=([0-9])*$/length=\1\/2/' ${stub}_2.fastq >${stub}c_2.fastq;
On Tue, Mar 31, 2015 at 9:47 AM, Julie Allen juliema@illinois.edu wrote:
ok let me try and see what I get. Do you mind just showing the head of the SRR610374.fastq file?
On Tue, Mar 31, 2015 at 5:31 AM, Emily Jane McTavish < notifications@github.com> wrote:
Tried it, same results: still creates files and hangs.
Log file: 150331 11:57:27: Running format_sra.pl -input SRR610374.fastq -out test2_atram_db, v1.01+ 150331 11:57:27: SRR610374.fastq is 11559.30 MB; we will make 23 shards. 150331 11:57:27: Dividing fasta/fastq file into buckets for sorting.
To replicate:
a=SRR610374.sra; stub=${a%.
}; fastq-dump --split-files $a; sed -e 's/length=([0-9])$/length=\1\1/' ${stub}_1.fastq > ${stub}c_1.fastq; sed -e 's/length=([0-9]*)$/length=\1\2/' ${stub}_2.fastq > ${stub}c_2.fastq; cat ${stub}c_1.fastq ${stub}c_2.fastq > ${stub}.fastq; ./format_sra.pl -input ${stub}.fastq -out test2_atram_db;
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-88034010.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Ok I have format.sra running on that file. I have not hit any snags yet and it seems like it is going through fine.
Let me know if this solves your issue so I can close this issue in github.
Thanks for using our program!
Julie
On Tue, Mar 31, 2015 at 12:26 PM, Julie Allen juliema@illinois.edu wrote:
Hey Emily,
I am still running your script but I found one thing.
the sed lines were not printing out the /1 and /2 at the end.
I am running these lines and then going to check format_sra.pl
change them to this: sed -E 's/length=([0-9])*$/length=\1\/1/' ${stub}_1.fastq
${stub}c_1.fastq;
sed -E 's/length=([0-9])*$/length=\1\/2/' ${stub}_2.fastq
${stub}c_2.fastq;
On Tue, Mar 31, 2015 at 9:47 AM, Julie Allen juliema@illinois.edu wrote:
ok let me try and see what I get. Do you mind just showing the head of the SRR610374.fastq file?
On Tue, Mar 31, 2015 at 5:31 AM, Emily Jane McTavish < notifications@github.com> wrote:
Tried it, same results: still creates files and hangs.
Log file: 150331 11:57:27: Running format_sra.pl -input SRR610374.fastq -out test2_atram_db, v1.01+ 150331 11:57:27: SRR610374.fastq is 11559.30 MB; we will make 23 shards. 150331 11:57:27: Dividing fasta/fastq file into buckets for sorting.
To replicate:
a=SRR610374.sra; stub=${a%.
}; fastq-dump --split-files $a; sed -e 's/length=([0-9])$/length=\1\1/' ${stub}_1.fastq > ${stub}c_1.fastq; sed -e 's/length=([0-9]*)$/length=\1\2/' ${stub}_2.fastq > ${stub}c_2.fastq; cat ${stub}c_1.fastq ${stub}c_2.fastq > ${stub}.fastq; ./format_sra.pl -input ${stub}.fastq -out test2_atram_db;
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-88034010.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Ah, the sed difference is are linux/mac, was succeeding on my machine. But format_sra.pl is still not running. sadly.
The beginning of my file is: $head SRR610374.fastq @SRR610374.1 708:2:1:0:185 length=76\1 NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGA +SRR610374.1 708:2:1:0:185 length=76\1 DKMMXXOXWWUYWWXYVXNVXVMU[[VOSWXNVYP[Y[YRGSWWYXW[SVW[WQUYV[YTYWYYYUU[YZYUY[[V @SRR610374.2 708:2:1:0:1322 length=76\1 NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAA +SRR610374.2 708:2:1:0:1322 length=76\1 DNPVUUX[VUWXUW[TVUNYUZZYZTVXXZY[ZXYXXWXQTTUTVYYVPWSXX[QOS[[Y[X[[[YYY[[YXUTBB @SRR610374.3 708:2:1:0:1333 length=76\1 NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCC
and the end:
tail SRR610374.fastq
+SRR610374.22668562 708:2:120:1788:1734 length=76\2
aaaaaaa\aVBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668563 708:2:120:1788:1774 length=76\2 AAGACTTATTTAGTCCTNNNTTNNNNNNNNNTNCATTANGNNNNNNCTTACCCGTCTCATCAGTTTTAACTTTATT +SRR610374.22668563 708:2:120:1788:1774 length=76\2
a\U\ba`abaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@SRR610374.22668564 708:2:120:1788:164 length=76\2
ACATTATTCGTGGAGAANNNCANNNNNNNNNCNCTGCCNANNNNNNCTTGCTCTATGAAGCATTAGGCGCAGNAGT
+SRR610374.22668564 708:2:120:1788:164 length=76\2
aaUbb[abbaaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
yeah mine just threw an error as well. What is going on with this file I wonder.
let me dig a bit more.
On Tue, Mar 31, 2015 at 2:15 PM, Emily Jane McTavish < notifications@github.com> wrote:
Ah, the sed difference is are linux/mac, was succeeding on my machine. But format_sra.pl is still not running. sadly.
The beginning of my file is: $head SRR610374.fastq @SRR610374.1 708:2:1:0:185 length=76\1
NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGA +SRR610374.1 708:2:1:0:185 length=76\1
DKMMXXOXWWUYWWXYVXNVXVMU[[VOSWXNVYP[Y[YRGSWWYXW[SVW[WQUYV[YTYWYYYUU[YZYUY[[V @SRR610374.2 708:2:1:0:1322 length=76\1
NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAA +SRR610374.2 708:2:1:0:1322 length=76\1
DNPVUUX[VUWXUW[TVUNYUZZYZTVXXZY[ZXYXXWXQTTUTVYYVPWSXX[QOS[[Y[X[[[YYY[[YXUTBB @SRR610374.3 708:2:1:0:1333 length=76\1
NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCC
and the end: tail SRR610374.fastq +SRR610374.22668562 708:2:120:1788:1734 length=76\2 aaaaaaa\a VBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668563 708:2:120:1788:1774 length=76\2
AAGACTTATTTAGTCCTNNNTTNNNNNNNNNTNCATTANGNNNNNNCTTACCCGTCTCATCAGTTTTAACTTTATT +SRR610374.22668563 708:2:120:1788:1774 length=76\2 a\U\ba`abaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668564 708:2:120:1788:164 length=76\2
ACATTATTCGTGGAGAANNNCANNNNNNNNNCNCTGCCNANNNNNNCTTGCTCTATGAAGCATTAGGCGCAGNAGT +SRR610374.22668564 708:2:120:1788:164 length=76\2
aaUbb[abbaaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-88214517.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Hi Emily,
Ok so I found one work around with this dataset. I am still not sure what is going on it seems to stop adding sequences to the buckets and ends up with empty files, this suggests there is something strange with the file. I have been checking format.sra on a bunch of datasets and it seems to work fine.
However for now I got this file to work by first turning the .sra file into a fasta file, adding the /1 and /2 onto it and then running format.sra.pl using only one shard. This will give you the aTRAM database in one single big databse, the blast will take a bit longer but it will work.
a=SRR610374.sra;
stub=${a%.};
./fastq-dump --fasta --split-files $a; sed -E 's/length=([0-9])*$/length=\1\/1/' ${stub}_1.fasta >${stub}c_1.fasta;
sed -E 's/length=([0-9])*$/length=\1\/2/' ${stub}_2.fasta >${stub}c_2.fasta;
cat ${stub}c_1.fasta ${stub}c_2.fasta > ${stub}.fasta;
format_sra.pl -input ${stub}.fasta -numshards 1 -output FASTA.test
On Tue, Mar 31, 2015 at 2:18 PM, Julie Allen juliema@illinois.edu wrote:
yeah mine just threw an error as well. What is going on with this file I wonder.
let me dig a bit more.
On Tue, Mar 31, 2015 at 2:15 PM, Emily Jane McTavish < notifications@github.com> wrote:
Ah, the sed difference is are linux/mac, was succeeding on my machine. But format_sra.pl is still not running. sadly.
The beginning of my file is: $head SRR610374.fastq @SRR610374.1 708:2:1:0:185 length=76\1
NAAAGTATTGACATCCTTAACCAAGCTACAGAGGTGATCAATTTTTGGACAGAAAGAGGCAAAACTACTCCTATGA +SRR610374.1 708:2:1:0:185 length=76\1
DKMMXXOXWWUYWWXYVXNVXVMU[[VOSWXNVYP[Y[YRGSWWYXW[SVW[WQUYV[YTYWYYYUU[YZYUY[[V @SRR610374.2 708:2:1:0:1322 length=76\1
NTATCCCTCACGATGCATAGCTTTTGCTGTTTTGTCAATCTGAAAGTTCCGTTTAATTTGATTTGTATTTCTAAAA +SRR610374.2 708:2:1:0:1322 length=76\1
DNPVUUX[VUWXUW[TVUNYUZZYZTVXXZY[ZXYXXWXQTTUTVYYVPWSXX[QOS[[Y[X[[[YYY[[YXUTBB @SRR610374.3 708:2:1:0:1333 length=76\1
NTCGTATTTGATCGTCCCTTAGCTTCGGTAACACAACGCTGGCAGTTAGCTTGTCAGGGGGGAATTTCTCACCTCC
and the end: tail SRR610374.fastq +SRR610374.22668562 708:2:120:1788:1734 length=76\2 aaaaaaa\a VBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668563 708:2:120:1788:1774 length=76\2
AAGACTTATTTAGTCCTNNNTTNNNNNNNNNTNCATTANGNNNNNNCTTACCCGTCTCATCAGTTTTAACTTTATT +SRR610374.22668563 708:2:120:1788:1774 length=76\2 a\U\ba`abaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR610374.22668564 708:2:120:1788:164 length=76\2
ACATTATTCGTGGAGAANNNCANNNNNNNNNCNCTGCCNANNNNNNCTTGCTCTATGAAGCATTAGGCGCAGNAGT +SRR610374.22668564 708:2:120:1788:164 length=76\2
aaUbb[abbaaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-88214517.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Hmm, I tried that, but still no luck.
$ ./format_sra.pl -input ${stub}.fasta -number 1 -output FASTA.test
Building a new DB, current time: 04/09/2015 11:56:13
New DB name: /home/ejmctavish/projects/Exelixis/aTRAM/FASTA.test.0.db
New DB title: /home/ejmctavish/projects/Exelixis/aTRAM/FASTA.test.0.1.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
BLAST options error: File /home/ejmctavish/projects/Exelixis/aTRAM/FASTA.test.0.1.fasta is empty
3 at /home/ejmctavish/projects/Exelixis/aTRAM/lib/System.pm line 76.
I think we should add an ability to use separate end files...I'll get on that. The code is already existing in other places, so it shouldn't take long.
I think this will be resolved with PR #157. I'm running the file right now to be sure, but it seems to work for a subset.
It works now!
@snacktavish, when you do the fastq-dump, make sure you use the --split-files
option so that you get separate paired reads, and then use the new version in the #157 PR as:
format_sra.pl -1 SRR610374_1.fastq -2 SRR610374_2.fastq -out my_atram_db
Success! (at least as far as the database construction). Thanks!
Yay!
great!!
On Fri, Apr 24, 2015 at 1:16 PM, Daisie Huang notifications@github.com wrote:
Yay!
— Reply to this email directly or view it on GitHub https://github.com/juliema/aTRAM/issues/154#issuecomment-96021089.
Julie Allen Postdoctoral Researcher Illinois Natural History Survey University of Illinois www.juliamallen.com
Running format_sra.pl -input SRR610374.fastq -out my_atram_db, creates a bunch of empty .bucket and .sorted files and then just hangs. Seems to get further if I run it on an sra file, (format_sra.pl -input SRR610374.sra -out my_atram_db), and creates .bucket files that are not empty, but also hangs, and all the .sorted are empty.
In both cases last line of log is: Dividing fasta/fastq file into buckets for sorting.
Advice? Thanks, this is a great idea and I look forward to using it!