s4hts / HTStream

A high throughput sequence read toolset using a streaming approach facilitated by Linux pipes
https://s4hts.github.io/HTStream/
Apache License 2.0
49 stars 9 forks source link

multiple file input comma sep not recognized #100

Open atgerritsen opened 7 years ago

atgerritsen commented 7 years ago

single file inputs are fine, but if multiple files are input with commas for reads 1 and 2 it will not run:

Unhandled Exception: File 00-RawData/41_S15_L001_R1_001.fastq.gz,00-RawData/41_S15_L002_R1_001.fastq.gz,00-RawData/41_S15_L003_R1_001.fastq.gz was not found

joe-angell commented 7 years ago

Did you try adding a space after the comma? What program/option are you using?

atgerritsen commented 7 years ago

tried a space, but then it would quit after the first file.

I'm running the q window trimmer and then streaming the output to ATtrim and overlap. I'm using a <(zcat file file) workaround for now that works quite well.

samhunter commented 7 years ago

Yeah it seems to not work: $stats -1 r1_1.fastq, r1_2.fastq -2 r2_1.fastq, r2_2.fastq -O -L stats_both.txt > /dev/null

    Unhandled Exception: File r1_1.fastq, was not found.

$ ll total 24 drwxrwxr-x 2 shunter shunter 4096 Oct 11 10:10 ./ drwxrwxr-x 3 shunter shunter 4096 Oct 11 10:06 ../ -rw-rw-r-- 1 shunter shunter 3375 Oct 11 10:06 r1_1.fastq -rw-rw-r-- 1 shunter shunter 3374 Oct 11 10:07 r1_2.fastq -rw-rw-r-- 1 shunter shunter 3375 Oct 11 10:07 r2_1.fastq -rw-rw-r-- 1 shunter shunter 3374 Oct 11 10:08 r2_2.fastq

samhunter commented 7 years ago

But, as Alida said, this does: $ stats -1 <(cat r1_1.fastq r1_2.fastq) -2 <(cat r2_1.fastq r2_2.fastq) -O -L stats_both.txt > /dev/null

or for gzip files:

$ stats -1 <(zcat r1_1.fastq.gz r1_2.fastq.gz) -2 <(zcat r2_1.fastq.gz r2_2.fastq.gz) -O -L stats_both.txt > /dev/null

joe-angell commented 7 years ago

Ahh yeah boost program options allows you to specify the argument multiple times:

stats -1 r1_1.fastq -1 r1_2.fastq ...
samhunter commented 7 years ago

Does it process them in sequential order (so r1_1.fastq will stay synced with r2_1.fastq)?

joe-angell commented 7 years ago

yes it should

joe-angell commented 7 years ago

So i recommend we just fix the program documentation to reflect the way to specify multiple files

samhunter commented 7 years ago

Yeah, that seems fine to me, and confirmed to work:

$ stats -1 r1_1.fastq.gz -1 r1_2.fastq.gz -2 r2_1.fastq.gz -2 r2_2.fastq.gz -L stats_both1.txt -p test

$ ll
total 48
drwxrwxr-x 2 shunter shunter 4096 Oct 11 10:30 ./
drwxrwxr-x 3 shunter shunter 4096 Oct 11 10:06 ../
-rw-rw-r-- 1 shunter shunter 1320 Oct 11 10:06 r1_1.fastq.gz
-rw-rw-r-- 1 shunter shunter 1353 Oct 11 10:07 r1_2.fastq.gz
-rw-rw-r-- 1 shunter shunter 1457 Oct 11 10:07 r2_1.fastq.gz
-rw-rw-r-- 1 shunter shunter 1495 Oct 11 10:08 r2_2.fastq.gz
-rw-rw-r-- 1 shunter shunter  304 Oct 11 10:30 stats_both1.txt
-rw-rw-r-- 1 shunter shunter  304 Oct 11 10:14 stats_both.txt
-rw-rw-r-- 1 shunter shunter 6749 Oct 11 10:30 test_R1.fastq
-rw-rw-r-- 1 shunter shunter 6749 Oct 11 10:30 test_R2.fastq

$ grep "@M" test_R1.fastq 
@M01380:60:000000000-B4Y4F:1:1102:19647:1810 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:17758:1815 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:15487:1852 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:15568:1885 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:17839:1973 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:13584:1990 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:23725:2033 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:8799:2072 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:10264:2072 1:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:20532:2101 1:N:0:GACTGTGA+CGGAAGAA

$ grep "@M" test_R2.fastq 
@M01380:60:000000000-B4Y4F:1:1102:19647:1810 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:17758:1815 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:15487:1852 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:15568:1885 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:17839:1973 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:13584:1990 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:23725:2033 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:8799:2072 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:10264:2072 2:N:0:GACTGTGA+CGGAAGAA
@M01380:60:000000000-B4Y4F:1:1102:20532:2101 2:N:0:GACTGTGA+CGGAAGAA
atgerritsen commented 7 years ago

A whole 3 extra characters every time? :-(

joe-angell commented 7 years ago

Feel free to submit a pr, :P

On Wed, Oct 11, 2017 at 10:36 AM, atgerritsen notifications@github.com wrote:

A whole 3 extra characters every time? :-(

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335889094, or mute the thread https://github.com/notifications/unsubscribe-auth/AHveyafhxCGZ3qxa8mx2Edkq9VVO6KP6ks5srPyvgaJpZM4P1nAY .

--

Joe Angell cell: (720) 260-2190

joe-angell commented 7 years ago

https://stackoverflow.com/questions/3065109/can-boost-program-options-separate-comma-separated-argument-values

something like this looks reasonable.

On Wed, Oct 11, 2017 at 10:41 AM, Joe Angell joe.d.angell@gmail.com wrote:

Feel free to submit a pr, :P

On Wed, Oct 11, 2017 at 10:36 AM, atgerritsen notifications@github.com wrote:

A whole 3 extra characters every time? :-(

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335889094, or mute the thread https://github.com/notifications/unsubscribe-auth/AHveyafhxCGZ3qxa8mx2Edkq9VVO6KP6ks5srPyvgaJpZM4P1nAY .

--

Joe Angell cell: (720) 260-2190

--

Joe Angell cell: (720) 260-2190

joe-angell commented 7 years ago

There is a reason why boost program options works the way it does, you cannot have a comma in the filename if you split on comma. You could then add escaping, but I'm not sure the additional complexity is worth saving 3 characters, :).

On Wed, Oct 11, 2017 at 10:43 AM, Joe Angell joe.d.angell@gmail.com wrote:

https://stackoverflow.com/questions/3065109/can-boost- program-options-separate-comma-separated-argument-values

something like this looks reasonable.

On Wed, Oct 11, 2017 at 10:41 AM, Joe Angell joe.d.angell@gmail.com wrote:

Feel free to submit a pr, :P

On Wed, Oct 11, 2017 at 10:36 AM, atgerritsen notifications@github.com wrote:

A whole 3 extra characters every time? :-(

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335889094, or mute the thread https://github.com/notifications/unsubscribe-auth/AHveyafhxCGZ3qxa8mx2Edkq9VVO6KP6ks5srPyvgaJpZM4P1nAY .

--

Joe Angell cell: (720) 260-2190

--

Joe Angell cell: (720) 260-2190

--

Joe Angell cell: (720) 260-2190

msettles commented 7 years ago

In that case there should be a check when reads are read in that they are concordat id(r1) == id(r2) if there isn't one already

Matt

On Wed, Oct 11, 2017 at 8:47 PM Joe Angell notifications@github.com wrote:

There is a reason why boost program options works the way it does, you cannot have a comma in the filename if you split on comma. You could then add escaping, but I'm not sure the additional complexity is worth saving 3 characters, :).

On Wed, Oct 11, 2017 at 10:43 AM, Joe Angell joe.d.angell@gmail.com wrote:

https://stackoverflow.com/questions/3065109/can-boost- program-options-separate-comma-separated-argument-values

something like this looks reasonable.

On Wed, Oct 11, 2017 at 10:41 AM, Joe Angell joe.d.angell@gmail.com wrote:

Feel free to submit a pr, :P

On Wed, Oct 11, 2017 at 10:36 AM, atgerritsen <notifications@github.com

wrote:

A whole 3 extra characters every time? :-(

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335889094, or mute the thread < https://github.com/notifications/unsubscribe-auth/AHveyafhxCGZ3qxa8mx2Edkq9VVO6KP6ks5srPyvgaJpZM4P1nAY

.

--

Joe Angell cell: (720) 260-2190

--

Joe Angell cell: (720) 260-2190

--

Joe Angell cell: (720) 260-2190

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335892488, or mute the thread https://github.com/notifications/unsubscribe-auth/AAno5lfnDm5KTU_SoygcuBSQK2ZHgipfks5srP8ygaJpZM4P1nAY .

-- Sent from Gmail Mobile on my IPAD Mini 4

samhunter commented 7 years ago

We just need an id() function that correctly handles all possible read ID formats.

samhunter commented 7 years ago

After some discussion, we think it would be better to test that all pairs of input files have the same number of records.

msettles commented 7 years ago

Well checking the the read ids always match does that too, don't need any special handler, just straight up ==

On Wed, Oct 11, 2017 at 9:25 PM Sam Hunter notifications@github.com wrote:

After some discussion, we think it would be better to test that all pairs of input files have the same number of records.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335903606, or mute the thread https://github.com/notifications/unsubscribe-auth/AAno5r5xF7zrxpF4jCGe07jAHwAu_qqyks5srQgugaJpZM4P1nAY .

-- Sent from Gmail Mobile on my IPAD Mini 4

samhunter commented 7 years ago
@some_goofy_format_R1
CCCAATAC
+
@@C@@E@E

@some_goofy_format_R2
CCCAATAC
+
@@C@@E@E
if( some_goofy_format_R1 != some_goofy_format_R2):
  fail_miserably
msettles commented 7 years ago

Easysome goofy format is unsupported

When have you ever seen that? I don't think I ever have. I mod read I'd all the time, taking care they remain concordat

Matt

On Wed, Oct 11, 2017 at 9:31 PM Sam Hunter notifications@github.com wrote:

@some_goofy_format_R1 CCCAATAC + @@C@@E@E

@some_goofy_format_R2 CCCAATAC + @@C@@E@E

if( some_goofy_format_R1 != some_goofy_format_R2): fail_miserably

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-335905249, or mute the thread https://github.com/notifications/unsubscribe-auth/AAno5i71JspSMkADc6RCTnIHiRjImlnCks5srQmBgaJpZM4P1nAY .

-- Sent from Gmail Mobile on my IPAD Mini 4

msettles commented 7 years ago

this seams relevant https://stackoverflow.com/questions/8175723/vector-arguments-in-boost-program-options

looks like the term ->multitoken() is needed to each add command

joe-angell commented 7 years ago

Yeah that looks like an easy fix.

On Sun, Oct 15, 2017 at 7:29 AM, Matt Settles notifications@github.com wrote:

this seams relevant https://stackoverflow.com/questions/8175723/vector- arguments-in-boost-program-options

looks like the term ->multitoken() is needed to each add command

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/100#issuecomment-336715259, or mute the thread https://github.com/notifications/unsubscribe-auth/AHveyUkGdDGe-Tb0fW9KjEgEpWD-ul0Fks5sshbMgaJpZM4P1nAY .

--

Joe Angell cell: (720) 260-2190

msettles commented 6 years ago

worked on in commit e15069d though its still not complete. So the changes due indeed allow and return multiple files in the file input parameters so -1 file1 file2 file3 -2 file1 file2 file2 -U file1 file2 and the app loops over the multiple files, but for some reason that I can't figure out only processes the first and doesn't process anything after.

I've figured out that that in InputReader the function has_next() returns 0 (should be 1) forever after the 1st set of files, why it doesn't process any more reads. But for the life of me I can't figure out how to reset so need some help from @dstreett, the below code in stats.cpp lines 91-98 are what I've been using to debug. In commented out line below ifp.has_next() is 1 for the first file and 0 for the second and ends.

            for(size_t i = 0; i < read1_files.size(); ++i) {
                bi::stream<bi::file_descriptor_source> is1{check_open_r(read1_files[i]), bi::close_handle};
                bi::stream<bi::file_descriptor_source> is2{check_open_r(read2_files[i]), bi::close_handle};

                InputReader<PairedEndRead, PairedEndReadFastqImpl> ifp(is1, is2);
                // std::cout << "read file: " << read1_files[i] << ":" << ifp.has_next() << "\n";
                helper_stats(ifp, pe, se, counters);
            }

Any fix will also likely needs to be applied to SE, Tab, interleaved, etc.

msettles commented 6 years ago

This now seems to be only an issue on Apple Macs

dstreett commented 6 years ago

Issues seems to have something to do with popen. @msettles was able to open multiple non-gzipped files, however, the application would only process the first gzipped file. I made changes in the code to cause both gzip and non gzipped files go to through popen (issue_100_osx_space_test) and both cases now only process the first.

    //causes only the first file to be processed in list for both gzipped and non-gzipped
    if (p.extension() == ".gz") {
        f = popen(("gunzip -c " + filename).c_str(), "r");
    } else {
        f = popen(("cat " + filename).c_str(), "r");
    }

Interesting bug, I will continue to track down.