open2c / distiller-nf

A modular Hi-C mapping pipeline
MIT License
86 stars 24 forks source link

distiller (bwa mem) cannot handle Hi-C data sequenced with 25bp #141

Open sergpolly opened 5 years ago

sergpolly commented 5 years ago

Rather major stuff! as discovered by @Marlies1993 bwa mem yields empty alignments for sequences of 25bp ... apparently it isn't designed for 25bp, and something like bwa aln should be used instead ...

@golobor @nvictus @mimakaev any thought on that ? switch to minimap ? I know nothing about bwa aln ...

we'll try to proceed with bowtie2 for this dataset and will try to avoid 25bp in future, but I think this issue is important enough to consider and address

meoomen commented 5 years ago

Bowtie2 command that did work for mapping 25bp reads: bowtie2 -x $genome -1 $read_1.fastq.gz -2 $read_2.fastq.gz

Thank you!

Phlya commented 5 years ago

Is there currently a supported sequencer that produces 25 bp reads? I haven't heard of any data in the last ~5 years with such short reads!

meoomen commented 5 years ago

It's just regular MiSeq paired end run. Since I'm working with a small genome and just testing several conditions, I figured it would be sufficient. But now I wish I would have used longer reads (>50bp)...

Phlya commented 5 years ago

Author of minimap says bwa-mem should be better for Hi-C. "Furthermore, I also realize that bwa-mem will be better than minimap2 at Hi-C alignment because bwa-mem is more sensitive to short matches." https://lh3.github.io/2018/04/02/minimap2-and-the-future-of-bwa

golobor commented 5 years ago

this seems important!

Not sure what would be the best way to deal with this issue. After a bit of googling, it does seem that aln would do a better job with shorter reads: https://sourceforge.net/p/bio-bwa/mailman/message/34978319/ http://crazyhottommy.blogspot.com/2017/06/bwa-aln-or-bwa-mem-for-short-reads-36bp.html

Importantly, mapping with aln is two-step: first, you run bwa aln and get .sai files, then you run bwa samse/sampe and convert ,sai files to .sams. We could introduce a switch to the logic of distiller if you feel that it could be important!

Anton.

On Wed, 15 May 2019 at 09:36, Ilya Flyamer notifications@github.com wrote:

Author of minimap says bwa-mem should be better for Hi-C. "Furthermore, I also realize that bwa-mem will be better than minimap2 at Hi-C alignment because bwa-mem is more sensitive to short matches." https://lh3.github.io/2018/04/02/minimap2-and-the-future-of-bwa

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/mirnylab/distiller-nf/issues/141?email_source=notifications&email_token=AAG64CSU46FZOHVEEFPQNU3PVQGWTA5CNFSM4HM3X7B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVOVW7I#issuecomment-492657533, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG64CSKJPRKQEIER376XSTPVQGWTANCNFSM4HM3X7BQ .

sergpolly commented 5 years ago

@golobor @Phlya @Marlies1993 after a little chat with Job - this is no longer URGENT, but should probably be addressed anyways - couple of arguments:

in our case it seems like bowtie2 did the job ok (for quick-n-dirty prelim check, at least), according to @Marlies1993 - we just plugged it in, instead of bwa: https://github.com/dekkerlab/distiller-nf/blob/a510c40471d267397164916d770bbe0d06dbb539/distiller.nf#L469 and rebuild the docker with bowtie2 inside

Importantly, mapping with aln is two-step: first, you run bwa aln and get .sai files, then you run bwa samse/sampe and convert ,sai files to .sams. We could introduce a switch to the logic of distiller if you feel that it could be important!

it's a shame it's not "plug n play" between bwa mem and bwa aln - but it seems like the best option to do this switch from mem to aln - with all of the intermediate steps - according to this story that @Phlya found https://github.com/mirnylab/distiller-nf/issues/141#issuecomment-492657533

golobor commented 5 years ago

Thank you for this update! These arguments seem very reasonable, indeed. The only issue with bowtie2 that I see is that it requires a different index - but it seems to be working just fine, right? re: bwa aln - we can hide away all of its logic into an .sh script that would create temp .sai files and then convert them into .sams, streaming out the result.

agalitsyna commented 4 years ago

Hi, this is not the problem for MiSeq small reads only. In methods like Hi-CO (https://doi.org/10.1016/j.cell.2018.12.014) the length of meaningful read parts is required to be 15-36 bp due to complex ligation procedure and adaptors trimming. bwa aln seems to handle the problem if inserted into distiller code at the step of mapping:


    bwa aln -t ${bwa_threads} ${bwa_index_base} ${fastq1} > ${library}.${run}.${ASSEMBLY_NAME}.${chunk}.1.sai

    bwa aln -t ${bwa_threads} ${bwa_index_base} ${fastq2} > ${library}.${run}.${ASSEMBLY_NAME}.${chunk}.2.sai

    bwa sampe ${bwa_index_base} ${library}.${run}.${ASSEMBLY_NAME}.${chunk}.1.sai ${library}.${run}.${ASSEMBLY_NAME}.${chunk}.2.sai ${fastq1} ${fastq2} \
        | pairtools parse ${dropsam_flag} ${dropreadid_flag} ${dropseq_flag} \
            ${parsing_options} \
            -c ${chrom_sizes} \
            | pairtools sort --nproc ${sorting_threads} \
                             -o ${library}.${run}.${ASSEMBLY_NAME}.${chunk}.pairsam.${suffix} \
                             --tmpdir \$TASK_TMP_DIR \
            | cat```
agalitsyna commented 3 years ago

Hi, there is now a test distiller option for short reads mapping with bwa aln (see long_reads option of map): https://github.com/open2c/distiller-nf/tree/mapping-options

Any suggestions/improvements and testing are much appreciated! @Phlya @Marlies1993 @sergpolly @golobor