Reads not pairing before mapping

mmilevskiy commented 3 years ago

Hi HiC-Pro Team,

I am consistently getting an error where the read1 and read2 files map separately. Bowtie then says that 100% of my reads are not paired and this leads to >40% unmapped. Is there an error somewhere in my config file or something that is preventing Hi-Pro from recognising the two files are paired sequencing?

Thanks,

Michael

DNase_config_motif_step1.txt

HiC-Pro -i /Fastq/Run1/ -o /pro_out/motif_trimming -c DNase_config_motif_step1.txt -s mapping

nservant commented 3 years ago

Hi Michael Nothing special in your config. Could you show me the first lines of a paired fastq files, just to check the read names. And please, which version are you using ? and what is the exact error message ? Thanks

mmilevskiy commented 3 years ago

Hi Nicolas,

Here are the first few lines or the R1 and R2 files of a sample:

head -12 file1_R1.fastq @NB500916:806:HHY2KAFX2:1:11101:14965:1020 1:N:0:CGATGT TGCAGNCTCACATTCATTGGTTACCCAAGGAGGTCGCTGTGCTGTGGACAGCAGCTTAACTGCAGGGGACCAAATGAAACTGCAAGGTTCGTCCATCGATCGATGGACGAACCTAAGTTGAATTTTCTCATAAATTATAGTTGAATATAAA + AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAEAEEEE<AEEEEEEAEEEAEEA/EEEAEEEEEEEEEEA<AEEE/AEEEAEAEAEAE/EAE/E/ @NB500916:806:HHY2KAFX2:1:11101:10793:1020 1:N:0:CGATGT GAACTNCAAGTCTCTGAAAAAAGAAATTTAAGAAGATCTCAGAAGATAGAAAGATTTCCCATGCTCACGAATCAGGTTCGTCCATCGATCGATGGACGCACCTAGGCTCAGCGTTCTTTACTTCCTGTGGGTGAAAAGACATAAACTTAGT + AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEE6EEEEEEEEEEEE/EEEEEEEEEEE<EEE</<AE/E/A//AE/<E//EA///E//</EA//A///E///E/<//6EAA/A//AEA/A</<AEE<E//A @NB500916:806:HHY2KAFX2:1:11101:6444:1020 1:N:0:CGATGT AACATNCGTTTTGTACTCAATGCCCATAAATCTGACACAGGTTCGTCCATCGATCGATGGACGAACCTATGGTCAGTCTCCTCCATTGTAGGGAGTCAGGGATGGTTCAATGTATATAAATTTTTAGATACAATAAAGCACATAACTAGAA + A6AAA#EEA<EEEAEEEEEEEEE/EEEEEEEEAEEEE/AEEAA<AE6<<EE<AE//<E</EE/EAE//E/E/E</EE</E/<E<E/E/A/E/E/E/E</EEEEEE<//<EAA//EEEAEEE<E/AA/EE<<<E<<EEE/AAE//<AA<EE<

head -12 file1_R2.fastq @NB500916:806:HHY2KAFX2:1:11101:14965:1020 2:N:0:CGATGT NNNAATGCTATTATAAATATGTTTCATCTCTGGAGGCAGGATATTTAGACCATATGATGGGTATGGGCTCAGTCATATAAGGAGTTCTTGTATAAGCATCTGCTCTATTTTATATTCAACTATAATTTATGAGAAAATTCAACTTAGGTTN +

AAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE<EAEEEEEEEEEEEEE<E/A<AEEEE/AAEEA<EEEAEA<EEEAAAEA<<EEA/E

@NB500916:806:HHY2KAFX2:1:11101:10793:1020 2:N:0:CGATGT NNNCAAAATAAAAGTAGATTTTTGCTGAGGGCCAATTATCTCCCCAGTTGTGTTCTTTCTTACACTAAGTTTATGTCTTTTCACCCACAGGAAGTAAAGAAAGCTGAGCCTAGGTTCGTCCATCGATCGATGGACGAACCTGATTCGTGA +

AAEAEE/EAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE/EEEEEE6EEE/AAEEEEEEEEAEEEEEAEEE/EEEEEEEEEEEEEEEEAEEEE<AAEEEEE/EE/EA</A/E/E/E/E//E/<E<///<E/AEE/

@NB500916:806:HHY2KAFX2:1:11101:6444:1020 2:N:0:CGATGT NNNAAAAGTAACTGGAAGATGAATGAGCATGTCCAATGACATACATGTATAACCATGTCAAAATGTAACCTATTATTTTATATGCTAACAAAGAACTGATTGTTTAAAAAACTAACTACGAGCTTTTCTCCATTTATGGAGATGACTGTAT +

AAEE6EEE<EE6EEEEE/EE//EE/EEE//<EEEE/EEEEEEEEEEEA/EEEA/<EE/EE/EAE/<EEE/EEEEA<EEEEEAEEEE/EEE/EEEAEEAEA<EAEEEA/EE/EEEAAAEEAAA/6AA/6AA</AAA</E/EAEA<A/<<

Also, I don't see an error message I don't think. It is just the alarming rate of unmapped reads and that 2 bam files come from the mapping step instead of one merged bam for both the left and right read. The log says that 100% of the reads are unmapped which I think is the error, see below.

file1_R1_bowtie2.log file1_R2_bowtie2.log

HiC-Pro v 2.11.4

nservant commented 3 years ago

ok I see ! Usually, with digestion Hi-C, we are doing two rounds of mapping to deal with chimeric reads (spanning the ligation junction et the two interacting loci). But in your case, as you are doing Dnase Hi-C, we can't do that as there is no ligation motif.

My feeling is that you have a lot of chimeric reads in your data. This can be due to a fragment size distribution which is smaller than expected or by the fact that here you have very long reads (151bp).

My advice would be to :

Trim the reads to 100 or even 75 or 50bp. This should be enough to map reads uniquely on the genome, and you should have much less chimeric reads.
Try to relax the bowtie2 parameters here, which are quite stringent by default. For instance, simply setting --very-sensitive --end-to-end --reorder could help you to rescue some alignments.
Last option, if you really want to keep your 150bp reads, I think you will have to use a local mapper. Bowtie2 can do it, but if I'm correct, it will find many mapping positions for a read, and you have no way to force the alignement to start from the 5' end of the reads. BWA-mem can do that I think, and actually many people now simply use BWA-mem for the Hi-C mapping. However, it is not included in HiC-pro, and I didn't test it myself.

To conclude, at least as a first trial, I would go for option 1 (reads trimming) Let me know if it works better Nicolas

mmilevskiy commented 3 years ago

Hi Nicolas,

Thanks for your suggestion. I am doing Omni-C which does indeed have a ligation motif:

AGGTTCGTCCATCGATCGATGGACGAACCT

You can see it in the above sequences, since the ligation motif is so larger, I think this is why they recommended doing 150bp sequencing instead of 75bp. I can re-try with relaxing bowtie2 parameters. The company originally recommended HiC-Pro, but now has their own pipeline, that I haven't been able to get up and running yet either.

https://github.com/dovetail-genomics/dovetail_tools

Thanks, Michael

nservant commented 3 years ago

Hi Michael

Thank you for the link. Indeed, they are using bwa now which is the only mapper able to do local mapping from the 5' end of reads. This is not something I will add to HiC-Pro. However, we can think about adding it in the the nf-core-hic pipeline (https://github.com/nf-core/hic) which is more flexible. Feel free to add an issue there.

But still, even if the ligation motif is very large, I think you should try to trim the reads to see if it improves the results. In practice, you do not necessarily want to read this ligation motif. This is more a control to see if the ligation works well. The only thing that you want is having the R1/R2 on different restriction fragments (or sufficiently distant in the case the DNAse Hi-C). I already analysed very nice samples with just 1% of reads spanning the junction.

Best

mmilevskiy commented 3 years ago

Hi Nicolas,

I tried two of the steps: : Reducing the bowtie2 stringency and : trimming the reads to 80bp

This increased mapping, see attached. After going through a few errors I used singularity to run HiC-Pro and everything appears to be ok I believe. I can generate the 2 plots with 53.3% reported pairs from the read trimmed fastq. Does the output all look ok now then? I might go with 75bp PE reads next time then.

plotHiCFragment_sample1.pdf plotMappingPairing_sample1.pdf

file1_R1_val_1_bowtie2.log file1_R2_val_2_bowtie2.log

Michael

nservant commented 3 years ago

Hi Michael, Yes, it looks ok. I think you are using an old HiC-Pro version (or I hope so actually) as there a small typo in the R plots. Anyway, one last suggestion, would be to decrease the min mapping quality. Here it seems that you are filtering 30% of reads because of low mapQ. Best Nicolas

mmilevskiy commented 3 years ago

Hi Nicolas,

Ok great. I will test out a few variations of sequence trimming length and mapping quality. Yes, noticed the singularity was version 3, I should get the latest version installed on our servers.

Thanks again, think that resolved this!

Michael

mmilevskiy commented 3 years ago

Resolved.

nservant / HiC-Pro