nservant / HiC-Pro

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing
Other
382 stars 183 forks source link

How does HiC-pro handle multi-mapped reads? #403

Closed YichaoOU closed 3 years ago

YichaoOU commented 3 years ago

Hello,

We performed capture-C where the human genome has >30 copies of inserted viruses. The bait sequence is from this virus so it will be mapped equally to all the copied sites. Two questions I have:

  1. Even I set RM_MULTI=0, the bowtie mapping step doesn't use the -k or -a option, which means the multi-mapped locations will be randomly picked. Is that correct?
  2. I can munally run bowtie with the multi-mapping options, then my next question is how does mergeSAM.py handle multi-mapped reads? since the input data is paire-end, ideally we want R1 and R2 to be at the same chromosome if one is multi-mapped.

Thanks, Yichao

nservant commented 3 years ago

Hi Yichao

Yes, by default, we are using the fact that bowtie reports one random position along the ones detected

Default mode: search for multiple alignments, report the best one
By default, Bowtie 2 searches for distinct, valid alignments for each read. When it finds a valid alignment, it generally will continue to look for alignments that are nearly as good or better. It will eventually stop looking, either because it exceeded a limit placed on search effort (see -D and -R) or because it already knows all it needs to know to report an alignment. Information from the best alignments are used to estimate mapping quality (the MAPQ SAM field) and to set SAM optional fields, such as AS:i and XS:i. Bowtie 2 does not guarantee that the alignment reported is the best possible in terms of alignment score.

HiC-Pro is using these two flags 'AS' / 'XS' to define if a read is unique or not, however, we do not make any assumption regarding read mates location ... as actually the goal of the Hi-C is also to capture long-range contact (even if I agree that the probability that close fragments interact is higher)

N

nservant commented 3 years ago

I also saw your second message about -k 30. I'm a bit surprised by the results actually ... As I told you, we are only using the 'AS'/'XS' fields to call unique/repeats hits. So, it means that the multiple positions reported by -k30 do not have the 'XS' field ? is that correct ?

Then, from the bowtie manual, it is written that -k 30 searches for up to N distinct, valid alignments for each read. It means that your R1/R2 bams may have different mapping position reported ... and in this case, I would expect mergeSAM.py to crash.

I would be interested if you have any other details on the results. I'll try to investigate this point on my side too.

YichaoOU commented 3 years ago

Thanks for your quick reply! My -k 30 observation was not correct, in fact, I used -k 1. I'm re-running with -k 30 now. I will let you know the result soon. So I guess I can modify mergeSAM.py to "correct" these discordant pairs. I don't know how to do it exactly. Have to do some research.

YichaoOU commented 3 years ago

Hello,

Indeed I got an error when running bowtie with -k 30:

--------------------------------------------
Wed Feb  3 21:42:15 CST 2021
Bowtie2 alignment step1 ...
Logs: logs/20copy/mapping_step1.log

--------------------------------------------
Thu Feb  4 01:54:26 CST 2021
Bowtie2 alignment step2 ...
Logs: logs/20copy/mapping_step2.log

--------------------------------------------
Thu Feb  4 02:21:59 CST 2021
Combine R1/R2 alignment files ...
Logs: logs/20copy/mapping_combine.log

--------------------------------------------
Thu Feb  4 02:43:10 CST 2021
Mapping statistics for R1 and R2 tags ...
Logs: logs/20copy/mapping_stats.log

--------------------------------------------
Thu Feb  4 02:58:08 CST 2021
Pairing of R1 and R2 tags ...
Logs: logs/20copy/mergeSAM.log
make: *** [bowtie_pairing] Error 1

This make: *** [bowtie_pairing] Error 1 is the mergeSAM.py error (I think). And this is the mergeSAM.log

less mergeSAM.log
-------------------

/hpcf/authorized_apps/rhel7_apps/python/install/2.7.15/bin/python /research/rgs01/applications/hpcf/authorized_apps/rhel7_apps/hic-pro/install/HiC-Pro_2.11.1/scripts/mergeSAM.py -q 0 -t -v -m -f bowtie_results/bwt2/20copy/01-20copy_R1_hg19_20copy.bwt2merged.bam -r bowtie_results/bwt2/20copy/01-20copy_R2_hg19_20copy.bwt2merged.bam -o bowtie_results/bwt2/20copy/01-20copy_hg19_20copy.bwt2pairs.bam
## mergeBAM.py
## forward= bowtie_results/bwt2/20copy/01-20copy_R1_hg19_20copy.bwt2merged.bam
## reverse= bowtie_results/bwt2/20copy/01-20copy_R2_hg19_20copy.bwt2merged.bam
## output= bowtie_results/bwt2/20copy/01-20copy_hg19_20copy.bwt2pairs.bam
## min mapq= 0
## report_single= False
## report_multi= True
## verbose= True
## Merging forward and reverse tags ...
Forward and reverse reads not paired. Check that BAM files have the same read names and are sorted.
nservant commented 3 years ago

Yes because it reads line by line the R1/R2 bams to save memory, and so expect to have the same read names in each line.

But I think -k1 should be good for your question ? as you should have a equal repartition of multi-mapped reads accross all loci ... I think it is still better than having many interaction from all repeats which are actually not "real" ?

And I still not understood why you have much less multi-mapped reads with -k 1 ? because you should still have the 'AS'/'XS' tag, isn't it ?

YichaoOU commented 3 years ago

Using -k 1, bowtie2 reported exactly 0% multi-mapped reads.

without -k
-----------
==> ./d3_keep_dup_captureC/hicpro_results/logs/d3/01-d3_R1_bowtie2.log <==
##HiC-Pro mapping
21535176 reads; of these:
  21535176 (100.00%) were unpaired; of these:
    3644897 (16.93%) aligned 0 times
    8175354 (37.96%) aligned exactly 1 time
    9714925 (45.11%) aligned >1 times
83.07% overall alignment rate

-k 1
----
==> ./d3_keep_dup_captureC_64d72f37bc28/hicpro_results/logs/d3/01-d3_R1_bowtie2.log <==
##HiC-Pro mapping
21535176 reads; of these:
  21535176 (100.00%) were unpaired; of these:
    3644897 (16.93%) aligned 0 times
    17890279 (83.07%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
83.07% overall alignment rate

-k 25
------
==> ./d3_keep_dup_bowtie2_k_captureC/hicpro_results/logs/d3/01-d3_R1_bowtie2.log <==
##HiC-Pro mapping
21535176 reads; of these:
  21535176 (100.00%) were unpaired; of these:
    3627575 (16.84%) aligned 0 times
    8142781 (37.81%) aligned exactly 1 time
    9764820 (45.34%) aligned >1 times
83.16% overall alignment rate

With -k 1, the bam file doesn't have XS tag.

AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:76 YT:Z:UU RG:Z:BMG

with -k 25 or without -k, it has both XS and AS tag:

without -k
-----------
AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:76 YT:Z:UU RG:Z:BMG

with -k 25
AS:i:-6 XS:i:-6 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:4G0C69 YT:Z:UU RG:Z:BMG
YichaoOU commented 3 years ago

Regarding it reads line by line the R1/R2 bams to save memory, and so expect to have the same read names in each line.

Then I can just sort the bam files by read name and mergeSAM should be good to go, right?

nservant commented 3 years ago

No that's the point. For instance, let's say that for a given pair, you have the R1 which align 1 time, and the R2 which align 5 times ... You will broke your pairs information ....

If you want to do that you will need to modify the code, so that it loops only on one file if the two read names are differents ...

nservant commented 3 years ago

And thank you very much for the stats reports, that's really helpful !

YichaoOU commented 3 years ago

Got it! I think I know how to proceed now.

It's good to know these many details.

Thanks, Yichao

pna059 commented 2 years ago

Hi, I am working with a plant genome, full of repeats. I have analyzed our Hi-C data using HiC-Pro and have ~60% multimapped reads. Therefore we are trying to find a way to include those in the resulting loops. I started mapping with the RM_MULTI=0, but the bowtie_global result looks identical to the RM_MULTI=1. What exactly does this setting influence and can I maybe start the pipeline from on of the later steps, without the need to repeat mapping?