Open atgerritsen opened 7 years ago
Did you try adding a space after the comma? What program/option are you using?
tried a space, but then it would quit after the first file.
I'm running the q window trimmer and then streaming the output to ATtrim and overlap. I'm using a <(zcat file file) workaround for now that works quite well.
Yeah it seems to not work: $stats -1 r1_1.fastq, r1_2.fastq -2 r2_1.fastq, r2_2.fastq -O -L stats_both.txt > /dev/null
Unhandled Exception: File r1_1.fastq, was not found.
$ ll total 24 drwxrwxr-x 2 shunter shunter 4096 Oct 11 10:10 ./ drwxrwxr-x 3 shunter shunter 4096 Oct 11 10:06 ../ -rw-rw-r-- 1 shunter shunter 3375 Oct 11 10:06 r1_1.fastq -rw-rw-r-- 1 shunter shunter 3374 Oct 11 10:07 r1_2.fastq -rw-rw-r-- 1 shunter shunter 3375 Oct 11 10:07 r2_1.fastq -rw-rw-r-- 1 shunter shunter 3374 Oct 11 10:08 r2_2.fastq
But, as Alida said, this does: $ stats -1 <(cat r1_1.fastq r1_2.fastq) -2 <(cat r2_1.fastq r2_2.fastq) -O -L stats_both.txt > /dev/null
or for gzip files:
$ stats -1 <(zcat r1_1.fastq.gz r1_2.fastq.gz) -2 <(zcat r2_1.fastq.gz r2_2.fastq.gz) -O -L stats_both.txt > /dev/null
Ahh yeah boost program options allows you to specify the argument multiple times:
stats -1 r1_1.fastq -1 r1_2.fastq ...
Does it process them in sequential order (so r1_1.fastq will stay synced with r2_1.fastq)?
yes it should
So i recommend we just fix the program documentation to reflect the way to specify multiple files
Yeah, that seems fine to me, and confirmed to work:
$ stats -1 r1_1.fastq.gz -1 r1_2.fastq.gz -2 r2_1.fastq.gz -2 r2_2.fastq.gz -L stats_both1.txt -p test
$ ll
total 48
drwxrwxr-x 2 shunter shunter 4096 Oct 11 10:30 ./
drwxrwxr-x 3 shunter shunter 4096 Oct 11 10:06 ../
-rw-rw-r-- 1 shunter shunter 1320 Oct 11 10:06 r1_1.fastq.gz
-rw-rw-r-- 1 shunter shunter 1353 Oct 11 10:07 r1_2.fastq.gz
-rw-rw-r-- 1 shunter shunter 1457 Oct 11 10:07 r2_1.fastq.gz
-rw-rw-r-- 1 shunter shunter 1495 Oct 11 10:08 r2_2.fastq.gz
-rw-rw-r-- 1 shunter shunter 304 Oct 11 10:30 stats_both1.txt
-rw-rw-r-- 1 shunter shunter 304 Oct 11 10:14 stats_both.txt
-rw-rw-r-- 1 shunter shunter 6749 Oct 11 10:30 test_R1.fastq
-rw-rw-r-- 1 shunter shunter 6749 Oct 11 10:30 test_R2.fastq
$ grep "@M" test_R1.fastq
@M01380:60:000000000-B4Y4F:1:1102:19647:1810 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:17758:1815 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:15487:1852 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:15568:1885 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:17839:1973 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:13584:1990 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:23725:2033 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:8799:2072 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:10264:2072 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:20532:2101 1:N:0:GACTGTGA+CGGAAGAA
$ grep "@M" test_R2.fastq
@M01380:60:000000000-B4Y4F:1:1102:19647:1810 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:17758:1815 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:15487:1852 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:15568:1885 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:17839:1973 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:13584:1990 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:23725:2033 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:8799:2072 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:10264:2072 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:20532:2101 2:N:0:GACTGTGA+CGGAAGAA
A whole 3 extra characters every time? :-(
Feel free to submit a pr, :P
On Wed, Oct 11, 2017 at 10:36 AM, atgerritsen notifications@github.com wrote:
A whole 3 extra characters every time? :-(
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335889094, or mute the thread https://github.com/notifications/unsubscribe-auth/AHveyafhxCGZ3qxa8mx2Edkq9VVO6KP6ks5srPyvgaJpZM4P1nAY .
Joe Angell cell: (720) 260-2190
something like this looks reasonable.
On Wed, Oct 11, 2017 at 10:41 AM, Joe Angell joe.d.angell@gmail.com wrote:
Feel free to submit a pr, :P
On Wed, Oct 11, 2017 at 10:36 AM, atgerritsen notifications@github.com wrote:
A whole 3 extra characters every time? :-(
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335889094, or mute the thread https://github.com/notifications/unsubscribe-auth/AHveyafhxCGZ3qxa8mx2Edkq9VVO6KP6ks5srPyvgaJpZM4P1nAY .
--
Joe Angell cell: (720) 260-2190
Joe Angell cell: (720) 260-2190
There is a reason why boost program options works the way it does, you cannot have a comma in the filename if you split on comma. You could then add escaping, but I'm not sure the additional complexity is worth saving 3 characters, :).
On Wed, Oct 11, 2017 at 10:43 AM, Joe Angell joe.d.angell@gmail.com wrote:
https://stackoverflow.com/questions/3065109/can-boost- program-options-separate-comma-separated-argument-values
something like this looks reasonable.
On Wed, Oct 11, 2017 at 10:41 AM, Joe Angell joe.d.angell@gmail.com wrote:
Feel free to submit a pr, :P
On Wed, Oct 11, 2017 at 10:36 AM, atgerritsen notifications@github.com wrote:
A whole 3 extra characters every time? :-(
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335889094, or mute the thread https://github.com/notifications/unsubscribe-auth/AHveyafhxCGZ3qxa8mx2Edkq9VVO6KP6ks5srPyvgaJpZM4P1nAY .
--
Joe Angell cell: (720) 260-2190
--
Joe Angell cell: (720) 260-2190
Joe Angell cell: (720) 260-2190
In that case there should be a check when reads are read in that they are concordat id(r1) == id(r2) if there isn't one already
Matt
On Wed, Oct 11, 2017 at 8:47 PM Joe Angell notifications@github.com wrote:
There is a reason why boost program options works the way it does, you cannot have a comma in the filename if you split on comma. You could then add escaping, but I'm not sure the additional complexity is worth saving 3 characters, :).
On Wed, Oct 11, 2017 at 10:43 AM, Joe Angell joe.d.angell@gmail.com wrote:
https://stackoverflow.com/questions/3065109/can-boost- program-options-separate-comma-separated-argument-values
something like this looks reasonable.
On Wed, Oct 11, 2017 at 10:41 AM, Joe Angell joe.d.angell@gmail.com wrote:
Feel free to submit a pr, :P
On Wed, Oct 11, 2017 at 10:36 AM, atgerritsen <notifications@github.com
wrote:
A whole 3 extra characters every time? :-(
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335889094, or mute the thread < https://github.com/notifications/unsubscribe-auth/AHveyafhxCGZ3qxa8mx2Edkq9VVO6KP6ks5srPyvgaJpZM4P1nAY
.
--
Joe Angell cell: (720) 260-2190
--
Joe Angell cell: (720) 260-2190
--
Joe Angell cell: (720) 260-2190
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335892488, or mute the thread https://github.com/notifications/unsubscribe-auth/AAno5lfnDm5KTU_SoygcuBSQK2ZHgipfks5srP8ygaJpZM4P1nAY .
-- Sent from Gmail Mobile on my IPAD Mini 4
We just need an id() function that correctly handles all possible read ID formats.
After some discussion, we think it would be better to test that all pairs of input files have the same number of records.
Well checking the the read ids always match does that too, don't need any special handler, just straight up ==
On Wed, Oct 11, 2017 at 9:25 PM Sam Hunter notifications@github.com wrote:
After some discussion, we think it would be better to test that all pairs of input files have the same number of records.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335903606, or mute the thread https://github.com/notifications/unsubscribe-auth/AAno5r5xF7zrxpF4jCGe07jAHwAu_qqyks5srQgugaJpZM4P1nAY .
-- Sent from Gmail Mobile on my IPAD Mini 4
@some_goofy_format_R1
CCCAATAC
+
@@C@@E@E
@some_goofy_format_R2
CCCAATAC
+
@@C@@E@E
if( some_goofy_format_R1 != some_goofy_format_R2):
fail_miserably
Easysome goofy format is unsupported
When have you ever seen that? I don't think I ever have. I mod read I'd all the time, taking care they remain concordat
Matt
On Wed, Oct 11, 2017 at 9:31 PM Sam Hunter notifications@github.com wrote:
@some_goofy_format_R1 CCCAATAC + @@C@@E@E
@some_goofy_format_R2 CCCAATAC + @@C@@E@E
if( some_goofy_format_R1 != some_goofy_format_R2): fail_miserably
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335905249, or mute the thread https://github.com/notifications/unsubscribe-auth/AAno5i71JspSMkADc6RCTnIHiRjImlnCks5srQmBgaJpZM4P1nAY .
-- Sent from Gmail Mobile on my IPAD Mini 4
this seams relevant https://stackoverflow.com/questions/8175723/vector-arguments-in-boost-program-options
looks like the term ->multitoken() is needed to each add command
Yeah that looks like an easy fix.
On Sun, Oct 15, 2017 at 7:29 AM, Matt Settles notifications@github.com wrote:
this seams relevant https://stackoverflow.com/questions/8175723/vector- arguments-in-boost-program-options
looks like the term ->multitoken() is needed to each add command
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-336715259, or mute the thread https://github.com/notifications/unsubscribe-auth/AHveyUkGdDGe-Tb0fW9KjEgEpWD-ul0Fks5sshbMgaJpZM4P1nAY .
Joe Angell cell: (720) 260-2190
worked on in commit e15069d though its still not complete. So the changes due indeed allow and return multiple files in the file input parameters so -1 file1 file2 file3 -2 file1 file2 file2 -U file1 file2 and the app loops over the multiple files, but for some reason that I can't figure out only processes the first and doesn't process anything after.
I've figured out that that in InputReader the function has_next() returns 0 (should be 1) forever after the 1st set of files, why it doesn't process any more reads. But for the life of me I can't figure out how to reset so need some help from @dstreett, the below code in stats.cpp lines 91-98 are what I've been using to debug. In commented out line below ifp.has_next() is 1 for the first file and 0 for the second and ends.
for(size_t i = 0; i < read1_files.size(); ++i) {
bi::stream<bi::file_descriptor_source> is1{check_open_r(read1_files[i]), bi::close_handle};
bi::stream<bi::file_descriptor_source> is2{check_open_r(read2_files[i]), bi::close_handle};
InputReader<PairedEndRead, PairedEndReadFastqImpl> ifp(is1, is2);
// std::cout << "read file: " << read1_files[i] << ":" << ifp.has_next() << "\n";
helper_stats(ifp, pe, se, counters);
}
Any fix will also likely needs to be applied to SE, Tab, interleaved, etc.
This now seems to be only an issue on Apple Macs
Issues seems to have something to do with popen. @msettles was able to open multiple non-gzipped files, however, the application would only process the first gzipped file. I made changes in the code to cause both gzip and non gzipped files go to through popen (issue_100_osx_space_test) and both cases now only process the first.
//causes only the first file to be processed in list for both gzipped and non-gzipped
if (p.extension() == ".gz") {
f = popen(("gunzip -c " + filename).c_str(), "r");
} else {
f = popen(("cat " + filename).c_str(), "r");
}
Interesting bug, I will continue to track down.
single file inputs are fine, but if multiple files are input with commas for reads 1 and 2 it will not run:
Unhandled Exception: File 00-RawData/41_S15_L001_R1_001.fastq.gz,00-RawData/41_S15_L002_R1_001.fastq.gz,00-RawData/41_S15_L003_R1_001.fastq.gz was not found