haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
189 stars 20 forks source link

Different ValidPairs rate between chromap and bowtie2 in HiC data #147

Open WangZhSi opened 8 months ago

WangZhSi commented 8 months ago

Problem

Different ValidPairs rate between chromap and bowtie2 in HiC data using HiC-Pro pipeline, which convert bam into *.VailidPairs format and stat. When I use chromap, result is: Dumped_pairs_rate%: 85.6677191749 When I use bowtie2, result is: Dumped_pairs_rate%: 0.269802518711(same data)

CMD

chromap(0.2.5-r473) chromap --preset hic -x genome.index -r genome.fa -1 R1 -2 R2 --SAM -o out.sam --remove-pcr-duplicates -t 8 --summary out.summary samtools view -bh out.sam | samtools sort -@ 8 > out.bam

bowtie2(2.2.3) bowtie2 --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder --rg-id --phred33-quals -p 5 --un out.unmap.fq --rg -x genome -U R1 | samtools view -F 4 -bS - > R1.bam same cmd for R2 then merge bam and sort using samtools to get out.bam

map2frag(HiC-Pro tools) python mapped_2hic_fragments.py -v -a -s 0 -l 700 -f DpnII_resfrag_genome.bed -r out.bam -o outdir Here, I get different ValidPairs rate.

Data Info

HiC Data: PE 150, 200X depth clean

I've used chromap in many genome, it is the first time this has happened. Chromap is very helpful.

Thanks!

mourisl commented 8 months ago

Is this human genome? Do you see very few alignment in the out.sam file as well?

WangZhSi commented 8 months ago

It's a plant genome. Here is the flagstat result of bam:

chromap.out.bam: 197585161 + 0 in total (QC-passed reads + QC-failed reads) 197585161 + 0 primary 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 197585161 + 0 mapped (100.00% : N/A) 197585161 + 0 primary mapped (100.00% : N/A) 197585161 + 0 paired in sequencing 98802418 + 0 read1 98782743 + 0 read2 197585161 + 0 properly paired (100.00% : N/A) 197585161 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 67103149 + 0 with mate mapped to a different chr 62048492 + 0 with mate mapped to a different chr (mapQ>=5)

bowtie2.out.bam(after merge, rmdup) 225507040 + 0 in total (QC-passed reads + QC-failed reads) 225507040 + 0 primary 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 225507040 + 0 mapped (100.00% : N/A) 225507040 + 0 primary mapped (100.00% : N/A) 225507040 + 0 paired in sequencing 112753520 + 0 read1 112753520 + 0 read2 225507040 + 0 properly paired (100.00% : N/A) 225507040 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 76941620 + 0 with mate mapped to a different chr 74674808 + 0 with mate mapped to a different chr (mapQ>=5)

There seems OK with alignment, I'm not terribly bothered by the difference in the number of reads mapping.

mourisl commented 8 months ago

Then the discrepancy between Chromap and Bowtie2 probably comes from the map2frag stage. If this tool use TLEN field in the BAM file to determine the insert size, there could be an issue. We recently fixed a bug for the hi-c data: https://github.com/haowenz/chromap/issues/139. If the TLEN is an issue in your data, could you please pull the version on github and give it a try? If this fixes your bug, we will release a version.

WangZhSi commented 8 months ago

I pull li_dev5 branch and got this error:

`In file included from src/chromap_driver.cc:10:0: src/chromap.h: In instantiation of ‘void chromap::Chromap::MapSingleEndReads() [with MappingRecord = chromap::PAFMapping]’: src/chromap_driver.cc:669:70: required from here src/chromap.h:302:9: error: ‘chromap::Chromap::num_correctedbarcode’ is not a variable in clause ‘reduction’

pragma omp parallel shared(numreads, mm_history, reference, index, read_batch, barcode_batch, read_batch_for_loading, barcode_batch_for_loading, std::cerr, num_loaded_reads_for_loading, num_loaded_reads, num_reference_sequences, mappings_on_diff_ref_seqs_for_diff_threads, mappings_on_diff_ref_seqs_for_diff_threads_for_saving, mappings_on_diff_ref_seqs, temp_mapping_file_handles, mm_to_candidates_cache, mapping_writer, minimizer_generator, candidate_processor, mapping_processor, draft_mapping_generator, mapping_generator, num_mappings_in_mem, max_num_mappings_in_mem) num_threads(mappingparameters.num_threads) reduction(+:numcandidates, nummappings, num_mappedreads, num_uniquely_mappedreads, num_barcode_inwhitelist, num_correctedbarcode)

src/chromap.h:302:9: error: ‘chromap::Chromap::num_barcode_inwhitelist’ is not a variable in clause ‘reduction’ src/chromap.h:302:9: error: ‘chromap::Chromap::num_uniquely_mappedreads’ is not a variable in clause ‘reduction’ src/chromap.h:302:9: error: ‘chromap::Chromap::num_mappedreads’ is not a variable in clause ‘reduction’ src/chromap.h:302:9: error: ‘chromap::Chromap::nummappings’ is not a variable in clause ‘reduction’ src/chromap.h:302:9: error: ‘chromap::Chromap::numcandidates’ is not a variable in clause ‘reduction’ src/chromap.h:302:9: error: ‘chromap::Chromap::numreads’ is not a variable in clause ‘shared’ src/chromap.h: In instantiation of ‘void chromap::Chromap::MapSingleEndReads() [with MappingRecord = chromap::SAMMapping]’: src/chromap_driver.cc:673:70: required from here`

my cmd: cd chromap-li_dev5 && make

mourisl commented 8 months ago

You can use the main branch. It should contain the fix for the TLEN. I will check the issue you found for the li_dev5.

WangZhSi commented 8 months ago

I pull main branch and re-run my data(mapping using new chromap, sort, Map2frag), the Dumped_pairs_rate has NOTING change.

But I check TLEN in bam(or sam) from chromap, seems like that's where the error is coming from.

image

BTW, the error I got when I install li_dev5 branch comes from my OS version error, you can forget it.

mourisl commented 8 months ago

You can see that the alignments are indeed very far from each other, so the absolute value of TLEN is large. The sign is based on the forward or reverse strand of the mate pairs. I think in HiC the fragment size can be large, so the command "python mapped_2hic_fragments.py -v -a -s 0 -l 700 " that restricts the insert size to be between 0 and 700 might not be desirable?

WangZhSi commented 8 months ago

Thank you for your advice! I changed my command as python mapped_2hic_fragments.py -v -a -s 0 -l 100000000 which restricts the insert size to be between 0 and 100Mb, and the Dumped_pairs_rate get a little better to 84.56%(from 85.66%). Maybe the error comes from bam to vaildPairs with HiC-pro pipeline, but I don't know why, hope you can solve this problem. If you need any file or mid-result, please let me know.

Anyway, I'll try another pipeline form chromap to .hic file. I think there's no substitute for chromap's speed advantage!

mourisl commented 8 months ago

Thank you! I think for the hic data, there is no need to restrain the insert size, because many alignments are expected to be far from each other and on different chromosomes. I'm curious how Bowtie2 set TLEN values in your data.