nservant / HiC-Pro

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing
Other
386 stars 182 forks source link

Reads not pairing before mapping #397

Closed mmilevskiy closed 3 years ago

mmilevskiy commented 3 years ago

Hi HiC-Pro Team,

I am consistently getting an error where the read1 and read2 files map separately. Bowtie then says that 100% of my reads are not paired and this leads to >40% unmapped. Is there an error somewhere in my config file or something that is preventing Hi-Pro from recognising the two files are paired sequencing?

Thanks,

Michael

DNase_config_motif_step1.txt

HiC-Pro -i /Fastq/Run1/ -o /pro_out/motif_trimming -c DNase_config_motif_step1.txt -s mapping

nservant commented 3 years ago

Hi Michael Nothing special in your config. Could you show me the first lines of a paired fastq files, just to check the read names. And please, which version are you using ? and what is the exact error message ? Thanks

mmilevskiy commented 3 years ago

Hi Nicolas,

Here are the first few lines or the R1 and R2 files of a sample:

head -12 file1_R1.fastq @NB500916:806:HHY2KAFX2:1:11101:14965:1020 1:N:0:CGATGT TGCAGNCTCACATTCATTGGTTACCCAAGGAGGTCGCTGTGCTGTGGACAGCAGCTTAACTGCAGGGGACCAAATGAAACTGCAAGGTTCGTCCATCGATCGATGGACGAACCTAAGTTGAATTTTCTCATAAATTATAGTTGAATATAAA + AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAEAEEEE<AEEEEEEAEEEAEEA/EEEAEEEEEEEEEEA<AEEE/AEEEAEAEAEAE/EAE/E/ @NB500916:806:HHY2KAFX2:1:11101:10793:1020 1:N:0:CGATGT GAACTNCAAGTCTCTGAAAAAAGAAATTTAAGAAGATCTCAGAAGATAGAAAGATTTCCCATGCTCACGAATCAGGTTCGTCCATCGATCGATGGACGCACCTAGGCTCAGCGTTCTTTACTTCCTGTGGGTGAAAAGACATAAACTTAGT + AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEE6EEEEEEEEEEEE/EEEEEEEEEEE<EEE</<AE/E/A//AE/<E//EA///E//</EA//A///E///E/<//6EAA/A//AEA/A</<AEE<E//A @NB500916:806:HHY2KAFX2:1:11101:6444:1020 1:N:0:CGATGT AACATNCGTTTTGTACTCAATGCCCATAAATCTGACACAGGTTCGTCCATCGATCGATGGACGAACCTATGGTCAGTCTCCTCCATTGTAGGGAGTCAGGGATGGTTCAATGTATATAAATTTTTAGATACAATAAAGCACATAACTAGAA + A6AAA#EEA<EEEAEEEEEEEEE/EEEEEEEEAEEEE/AEEAA<AE6<<EE<AE//<E</EE/EAE//E/E/E</EE</E/<E<E/E/A/E/E/E/E</EEEEEE<//<EAA//EEEAEEE<E/AA/EE<<<E<<EEE/AAE//<AA<EE<

head -12 file1_R2.fastq @NB500916:806:HHY2KAFX2:1:11101:14965:1020 2:N:0:CGATGT NNNAATGCTATTATAAATATGTTTCATCTCTGGAGGCAGGATATTTAGACCATATGATGGGTATGGGCTCAGTCATATAAGGAGTTCTTGTATAAGCATCTGCTCTATTTTATATTCAACTATAATTTATGAGAAAATTCAACTTAGGTTN +

AAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE<EAEEEEEEEEEEEEE<E/A<AEEEE/AAEEA<EEEAEA<EEEAAAEA<<EEA/E

@NB500916:806:HHY2KAFX2:1:11101:10793:1020 2:N:0:CGATGT NNNCAAAATAAAAGTAGATTTTTGCTGAGGGCCAATTATCTCCCCAGTTGTGTTCTTTCTTACACTAAGTTTATGTCTTTTCACCCACAGGAAGTAAAGAAAGCTGAGCCTAGGTTCGTCCATCGATCGATGGACGAACCTGATTCGTGA +

AAEAEE/EAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE/EEEEEE6EEE/AAEEEEEEEEAEEEEEAEEE/EEEEEEEEEEEEEEEEAEEEE<AAEEEEE/EE/EA</A/E/E/E/E//E/<E<///<E/AEE/

@NB500916:806:HHY2KAFX2:1:11101:6444:1020 2:N:0:CGATGT NNNAAAAGTAACTGGAAGATGAATGAGCATGTCCAATGACATACATGTATAACCATGTCAAAATGTAACCTATTATTTTATATGCTAACAAAGAACTGATTGTTTAAAAAACTAACTACGAGCTTTTCTCCATTTATGGAGATGACTGTAT +

AAEE6EEE<EE6EEEEE/EE//EE/EEE//<EEEE/EEEEEEEEEEEA/EEEA/<EE/EE/EAE/<EEE/EEEEA<EEEEEAEEEE/EEE/EEEAEEAEA<EAEEEA/EE/EEEAAAEEAAA/6AA/6AA</AAA</E/EAEA<A/<<

Also, I don't see an error message I don't think. It is just the alarming rate of unmapped reads and that 2 bam files come from the mapping step instead of one merged bam for both the left and right read. The log says that 100% of the reads are unmapped which I think is the error, see below.

file1_R1_bowtie2.log file1_R2_bowtie2.log

HiC-Pro v 2.11.4

nservant commented 3 years ago

ok I see ! Usually, with digestion Hi-C, we are doing two rounds of mapping to deal with chimeric reads (spanning the ligation junction et the two interacting loci). But in your case, as you are doing Dnase Hi-C, we can't do that as there is no ligation motif.

My feeling is that you have a lot of chimeric reads in your data. This can be due to a fragment size distribution which is smaller than expected or by the fact that here you have very long reads (151bp).

My advice would be to :

To conclude, at least as a first trial, I would go for option 1 (reads trimming) Let me know if it works better Nicolas

mmilevskiy commented 3 years ago

Hi Nicolas,

Thanks for your suggestion. I am doing Omni-C which does indeed have a ligation motif:

AGGTTCGTCCATCGATCGATGGACGAACCT

You can see it in the above sequences, since the ligation motif is so larger, I think this is why they recommended doing 150bp sequencing instead of 75bp. I can re-try with relaxing bowtie2 parameters. The company originally recommended HiC-Pro, but now has their own pipeline, that I haven't been able to get up and running yet either.

https://github.com/dovetail-genomics/dovetail_tools

Thanks, Michael

nservant commented 3 years ago

Hi Michael

Thank you for the link. Indeed, they are using bwa now which is the only mapper able to do local mapping from the 5' end of reads. This is not something I will add to HiC-Pro. However, we can think about adding it in the the nf-core-hic pipeline (https://github.com/nf-core/hic) which is more flexible. Feel free to add an issue there.

But still, even if the ligation motif is very large, I think you should try to trim the reads to see if it improves the results. In practice, you do not necessarily want to read this ligation motif. This is more a control to see if the ligation works well. The only thing that you want is having the R1/R2 on different restriction fragments (or sufficiently distant in the case the DNAse Hi-C). I already analysed very nice samples with just 1% of reads spanning the junction.

Best

mmilevskiy commented 3 years ago

Hi Nicolas,

I tried two of the steps: : Reducing the bowtie2 stringency and : trimming the reads to 80bp

This increased mapping, see attached. After going through a few errors I used singularity to run HiC-Pro and everything appears to be ok I believe. I can generate the 2 plots with 53.3% reported pairs from the read trimmed fastq. Does the output all look ok now then? I might go with 75bp PE reads next time then.

plotHiCFragment_sample1.pdf plotMappingPairing_sample1.pdf

file1_R1_val_1_bowtie2.log file1_R2_val_2_bowtie2.log

Michael

nservant commented 3 years ago

Hi Michael, Yes, it looks ok. I think you are using an old HiC-Pro version (or I hope so actually) as there a small typo in the R plots. Anyway, one last suggestion, would be to decrease the min mapping quality. Here it seems that you are filtering 30% of reads because of low mapQ. Best Nicolas

mmilevskiy commented 3 years ago

Hi Nicolas,

Ok great. I will test out a few variations of sequence trimming length and mapping quality. Yes, noticed the singularity was version 3, I should get the latest version installed on our servers.

Thanks again, think that resolved this!

Michael

mmilevskiy commented 3 years ago

Resolved.