lh3 / bfc

High-performance error correction for Illumina resequencing data
MIT License
68 stars 13 forks source link

bfc strips pairing info #8

Closed macmanes closed 9 years ago

macmanes commented 9 years ago

Heng - working with r181 here.

If a set of reads have paired-end info encoded as per the most recent fastq format:

@HWI-D00310:79:C730GANXX:3:1101:1153:1917 1:Y:0:ATTACTCG
NATTTTGTGGCCACAAAAGAGTATGAACATTTAAAATAGTGAGAGTGGATAACCTTTATAGAGGACCCAATAACACACTGGTCTTCTATGGTCCTCTCCTGGTGTCTGTAATGTGTTGTCTGTGC
+
#=30<E1=1>C1E01@00@C11>E1EGGF111111111?1<:1:E1:011<1<1>11>111=:1100/>0=F>1=1:C111:<:<1111111:000<00>F:;0:0>00;B00000;;C0<0000
@HWI-D00310:79:C730GANXX:3:1101:1153:1917 2:Y:0:ATTACTCG
TCTATTTGGTTCTGGTTGATACCTGTTGGAGTGGTTGAGGTAGTGTTGCATGGTATAAGGGTTAAAGGAATGGTTCCAGGTTTTCAGATTGATGAAGATTTTCATATTGTAGTGCTTTATGCGGC
+
3<3<0111100?11111/1@111=11<10E11101=/=111=>=111011?<1=11>1111110>11111:<11:11111=11010:0=00000=E0000?00000000000:008<00808...

bfc correction (bfc -s 800m -k 55 -t 16 inter.fq) results in stripping off the pairing info, which is problematic.

@HWI-D00310:79:C730GANXX:3:1101:1153:1917   ec:Z:3
NATTTTGTGGCCACAAAAGAGTATGAACATTTAAAATAGTGAGAGTGGATAACCTTTATAGAGGACCCAATAACACACTGGTCTTCTATGGTCCTCTCCTGGTGTCTGTAATGTGTTGTCTGTGC
+
#=30<E1=1>C1E01@00@C11>E1EGGF111111111?1<:1:E1:011<1<1>11>111=:1100/>0=F>1=1:C111:<:<1111111:000<00>F:;0:0>00;B00000;;C0<0000
@HWI-D00310:79:C730GANXX:3:1101:1153:1917   ec:Z:3
TCTATTTGGTTCTGGTTGATACCTGTTGGAGTGGTTGAGGTAGTGTTGCATGGTATAAGGGTTAAAGGAATGGTTCCAGGTTTTCAGATTGATGAAGATTTTCATATTGTAGTGCTTTATGCGGC
+
3<3<0111100?11111/1@111=11<10E11101=/=111=>=111011?<1=11>1111110>11111:<11:11111=11010:0=00000=E0000?00000000000:008<00808...

The workflow I am using is:

interleave-reads.py (from @dib-lab/khmer)
bfc
split-paired-reads.py
...

Obviously, when the pairing info is removed, splitting back into their left and right files does not work.

lh3 commented 9 years ago

But identical read names indicate that pairing information is kept?

macmanes commented 9 years ago

Right, but the standard tools for de-interleaving, for instance those developed by @ctb do not work unless the /1 and /2 or 1: and 2: 'tags' are there. Is there something in lh3/seqtk that would work for de-interleaving?

ctb commented 9 years ago

We can modify our scripts to handle that, no prob.

macmanes commented 9 years ago

Good deal, but not sure why that info should be stripped in the 1st place..

lh3 commented 9 years ago

not sure why that info should be stripped in the 1st place..

Because BFC replaces fastq comment with new information. Few tools look at comments, usually.

sashulkaSh commented 2 months ago

Is it possible to run bfc with forward reads and then with reverse reads? separately or all reads should be at one file?