statgen / bamUtil

http://genome.sph.umich.edu/wiki/BamUtil
87 stars 29 forks source link

Generating interleaved FASTQ for bam2FastQ? #10

Closed alexlancaster closed 9 years ago

alexlancaster commented 11 years ago

Any chance of adding the option to bam2FastQ to generate an interleaved FASTQ file for paired-end reads? Currently it defaults to 2 FASTQ files. Some tools (e.g. bwa) will handle this form of data, e.g. the "-p" option for the "bwa mem" command (http://bio-bwa.sourceforge.net/bwa.shtml). It also makes it a lot easier to pipe the output from bam2FastQ directly into bwa without writing intermediate files. If it's not a feature that's currently planned, I'd be willing to look into what would be required to create a patch. Thanks!

mktrost commented 11 years ago

It was not something I was planning, but if it would be useful, it shouldn't be too difficult to add it.

Just a few questions (probably the hardest part of implementing it):

Would both ends as well as any unpaired reads (if any) all get written to 1 file in this case? Based on my quick read of the bwa page, I think this is a no. Would a user want the unpaired reads at all? Should there be an option to disable outputing unpaired reads?

Filename extension question: Typically the naming convention is: first in pair is outBase_1.fastq 2nd in pair is outBase_2.fastq unpaired in pair is outBase.fastq What would you want the naming convention to be for the interleaved file (if not written to stdout)?

I think that it makes sense to always write the first in pair first. The logic already supports this.

Would you like a flag, like --interleave? I'm open to flag name suggestions. Or would you rather just specify --firstOut & --secondOut to be the same filename? Either way should be easy enough to implement (and maybe both should be allowed).

I think for implementation, it would be checking the flag/filenames and if any are the same, then only call ifopen once and set the pointers, then only close once.

Thanks for the recommendation. Below I have some roughed out code that I am currently thinking might work. That way if I go to implement it later I remember what I am thinking now or if you want to implement it, you can see what I'm thinking.

Mary Kate

If you just do --firstOut - --secondOut - and add the following code, it might work

Instead of the current ifopen logic, use:

myUnpairedFile = ifopen(unpairedOut, "w");
if(myFirstFile == myUnpairedFile)
{
    myFirstFile = myUnpairedFile;
}
else
{
    myFirstFile = ifopen(firstOut, "w");
}
if(mySecondFile == myUnpairedFile)
{
    mySecondFile = myUnpairedFile;
}
else if(mySecondFile == myFirstFile)
{
    mySecondFile = myFirstFile;
}
else
{
    mySecondFile = ifopen(secondOut, "w");
}

Or if it never makes sense to interleave unpaired, then just check the first & second files against each other.

Then I'd make a closeFiles() method:

if((mySecondFile != myFirstFile) && (mySecondFile != myUnpairedFile) && (mySecondFile != NULL)) { ifclose(mySecondFile); mySecondFile = NULL; } if((myFirstFile != myUnpairedFile) && (mySecondFile != NULL)) { ifclose(myFirstFile); myFirstFile = NULL; } if(myUnpairedFile != NULL) { ifclose(myUnpairedFile); myUnpairedFile = NULL; }

Then call closeFiles() in the destructor and at the end of execute rather than manually closing each one.

On Fri, Sep 20, 2013 at 3:15 PM, Alex Lancaster notifications@github.comwrote:

Any chance of adding the option to bam2FastQ to generate an interleaved FASTQ file for paired-end reads? Currently it defaults to 2 FASTQ files. Some tools (e.g. bwa) will handle this form of data, e.g. the "-p" option for the "bwa mem" command (http://bio-bwa.sourceforge.net/bwa.shtml). It also makes it a lot easier to pipe the output from bam2FastQ directly into bwa without writing intermediate files. If it's not a feature that's currently planned, I'd be willing to look into what would be required to create a patch. Thanks!

— Reply to this email directly or view it on GitHubhttps://github.com/statgen/bamUtil/issues/10 .

alexlancaster commented 10 years ago

I think it would have to put unpaired reads in perhaps a separate file by default, maybe add an option to append them? For naming convention maybe outBase_interleaved.fastq by default? Yes, definitely the first pair first.

I am also testing "htscmd bam2fq" which is part of https://github.com/samtools/htslib which also generates interleaved FASTQ, but in testing it seems it does not necessarily interleave them using the (2i), (2i+1) convention as outlined in the "bwa mem" docs. In other words, if I use "bam2FastQ", I get the following:

$ grep '^\@F' 111582.test_1.fastq |head -2
@FCC0837ACXX:5:2307:12026:198869#/1
@FCC0837ACXX:5:1202:10619:55648#/1
$ grep '^\@F' 111582.test_2.fastq |head -2
@FCC0837ACXX:5:2307:12026:198869#/2
@FCC0837ACXX:5:1202:10619:55648#/2

whereas using the output from "htscmd bam2fq -a" on the same original bam I get this:

$ grep '^\@F' 111582.test.sorted.interleaved.fastq |head -4
@FCC0837ACXX:5:1307:4114:17474#/2
@FCD0JP0ACXX:6:1101:11361:95577#/1
@FCC0837ACXX:5:1101:8706:79404#/2
@FCD0JP0ACXX:6:1104:14809:174913#/2

so ideally Bam2FastQ would interleave them correctly.

mktrost commented 10 years ago

Ok,

I have updated bam2Fastq. Can you do a "git pull" to pull in the latest code?

I added a --merge option. (--interleave didn't work due to a naming conflict with the --in option). It produces a file with _interleaved.fastq extension.

You can change the name of the interleaved file by using --firstOut.

bin/bam bam2FastQ --in input.sam --merge Should create: input.fastq (unpaired) & input_interleaved.fastq (1st & 2nd in pair)

So if you want interleaved written to stdout:

bin/bam bam2FastQ --in input.sam --firstOut - --merge

It will write input.fastq containing the unpaired files.

Let me know if this works for you, if you have any questions, or suggestions to improve this new feature.

Mary Kate

On Thu, Sep 26, 2013 at 4:46 PM, Alex Lancaster notifications@github.comwrote:

I think it would have to put unpaired reads in perhaps a separate file by default, maybe add an option to append them? For naming convention maybe outBase_interleaved.fastq by default? Yes, definitely the first pair first.

I am also testing "htscmd bam2fq" which is part of https://github.com/samtools/htslib which also generates interleaved FASTQ, but in testing it seems it does not necessarily interleave them using the (2i), (2i+1) convention as outlined in the "bwa mem" docs. In other words, if I use "bam2FastQ", I get the following:

$ grep '^\@F' 111582.test_1.fastq |head -2 @FCC0837ACXX:5:2307:12026:198869#/1 @FCC0837ACXX:5:1202:10619:55648#/1 $ grep '^\@F' 111582.test_2.fastq |head -2 @FCC0837ACXX:5:2307:12026:198869#/2 @FCC0837ACXX:5:1202:10619:55648#/2

whereas using the output from "htscmd bam2fq -a" on the same original bam I get this:

$ grep '^\@F' 111582.test.sorted.interleaved.fastq |head -4 @FCC0837ACXX:5:1307:4114:17474#/2 @FCD0JP0ACXX:6:1101:11361:95577#/1 @FCC0837ACXX:5:1101:8706:79404#/2 @FCD0JP0ACXX:6:1104:14809:174913#/2

so ideally Bam2FastQ would interleave them correctly.

— Reply to this email directly or view it on GitHubhttps://github.com/statgen/bamUtil/issues/10#issuecomment-25202307 .

alexlancaster commented 10 years ago

Great!!

Sorry for only getting back to this now, so the new option is --merge? I think it seems to be working for us now. Let me run some more tests then I'll close.

mktrost commented 10 years ago

Correct, it is called "--merge" (I couldn't use "--interleave" because it caused confusion with the "--in" parameter.) Yes, please let me know how your testing goes. Also, let me know if you need/notice anything else.

On Fri, Mar 14, 2014 at 4:21 PM, Alex Lancaster notifications@github.comwrote:

Great!!

Sorry for only getting back to this now, so the new option is --merge? I think it seems to be working for us now. Let me run some more tests then I'll close.

Reply to this email directly or view it on GitHubhttps://github.com/statgen/bamUtil/issues/10#issuecomment-37691165 .