open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
104 stars 32 forks source link

What's the difference between "uu' and "UU" in .pairs file? #206

Closed YingChen94 closed 9 months ago

YingChen94 commented 10 months ago

My mapped.pairs file has "UU" and "uu" in the last column. What's the difference between them? For example, the first 5 lines after # lines of the mapped.pairs file:

A01754:138:HNY3CDSX5:4:1369:5565:25567  h1tg000001l 3794    h1tg000001l 282288  +   +   UR
A01754:138:HNY3CDSX5:4:1346:17508:27962 h1tg000001l 5623    h1tg000001l 5743    +   -   uu
A01754:138:HNY3CDSX5:4:2325:32154:2159  h1tg000001l 5651    h1tg000001l 10203   +   -   UU
A01754:138:HNY3CDSX5:4:1511:18855:8030  h1tg000001l 5660    h1tg000001l 1409518 -   -   uu
A01754:138:HNY3CDSX5:4:2473:16034:33583 h1tg000001l 5679    h1tg000001l 723834  -   -   uu

The codes are:

bwa mem -5SP -T0 -t44 $HAP1 $HiC1 $HiC2 | \
conda run -n env_hic --live-stream pairtools parse --min-mapq 40 --walks-policy 5unique --max-inter-align-gap 30 --nproc-in 44 --nproc-out 44 --chroms-path $HAP1 | \
conda run -n env_hic --live-stream pairtools sort --tmpdir=$TEMPDIR --nproc 44 | \
conda run -n env_hic --live-stream pairtools dedup --nproc-in 44 --nproc-out 44 --mark-dups --output-stats hap1_stats.txt | \
conda run -n env_hic --live-stream pairtools split --nproc-in 44 --nproc-out 44 --output-pairs hap1_mapped.pairs --output-sam -|samtools view -bS -@44 | \
samtools sort -@44 -o hap1.mapped.PT.bam;samtools index hap1.mapped.PT.bam

I also ran the same codes except using --walk-policy all, and the first 5 lines became all "UU":

A01754:138:HNY3CDSX5:4:1369:5565:25567  h1tg000001l 3794    h1tg000001l 282288  +   +   UU
A01754:138:HNY3CDSX5:4:2325:32154:2159  h1tg000001l 5651    h1tg000001l 10203   +   -   UU
A01754:138:HNY3CDSX5:4:1511:18855:8030  h1tg000001l 5660    h1tg000001l 1878132 -   +   UU
A01754:138:HNY3CDSX5:4:1603:4345:34569  h1tg000001l 5711    h1tg000001l 6097    +   -   UU
A01754:138:HNY3CDSX5:4:1444:5864:33098  h1tg000001l 5791    h1tg000001l 177815  -   +   UU

I also noticed that the number of mapped read pairs is lower in --walk-policy all (18.09%) compared to --walk-policy 5unique (33.68%). Shouldn't --walk-policy all recover more ligations?

Any insights are appreciated! Thank you in advance! Ying

Phlya commented 10 months ago

uu are "rescued" pairs. You can basically treat them the same as UU, we are considering reverting this change and just label them as UU as it used to be. Sorry about the confusion...

Regarding mapped reads and walk policies, --walk-policy all actually reports more total "reads", because one read can generate multiple pairs... And if additional pairs often are unmapped, you can end up with lower % of mapped pairs. You can check the raw numbers in stats to see if this is the right idea...

YingChen94 commented 10 months ago

Thank you Ilya for your fast reply! It seems that --walk-policy 5unique actually recovered more pairs. Here are the stats:

--walk-policy 5unique:

Total Read Pairs                              654,871,310  100%
Unmapped Read Pairs                           156,646,648  23.92%
Mapped Read Pairs                             220,536,058  33.68%
PCR Dup Read Pairs                            67,306,526   10.28%
No-Dup Read Pairs                             153,229,532  23.4%
No-Dup Cis Read Pairs                         76,722,118   50.07%
No-Dup Trans Read Pairs                       76,507,414   49.93%
No-Dup Valid Read Pairs (cis >= 1kb + trans)  118,133,051  77.1%
No-Dup Cis Read Pairs < 1kb                   35,096,481   22.9%
No-Dup Cis Read Pairs >= 1kb                  41,625,637   27.17%
No-Dup Cis Read Pairs >= 10kb                 30,129,162   19.66%

--walk-policy all:

Total Read Pairs                              1,084,959,417  100%
Unmapped Read Pairs                           381,460,846    35.16%
Mapped Read Pairs                             196,295,065    18.09%
PCR Dup Read Pairs                            56,691,289     5.23%
No-Dup Read Pairs                             139,603,776    12.87%
No-Dup Cis Read Pairs                         70,322,699     50.37%
No-Dup Trans Read Pairs                       69,281,077     49.63%
No-Dup Valid Read Pairs (cis >= 1kb + trans)  112,044,026    80.26%
No-Dup Cis Read Pairs < 1kb                   27,559,750     19.74%
No-Dup Cis Read Pairs >= 1kb                  42,762,949     30.63%
No-Dup Cis Read Pairs >= 10kb                 28,578,042     20.47%

I'm using the Omni-C to scaffold the genome assembly. Do you think --walk-policy 5unique is better since it has more total pairs? Any insights are greatly appreciated! Thanks!

Phlya commented 10 months ago

Mh I guess it's possible that the unmapped sequences are more often on 3' end of the read, then with 5unique it would just take the 5' segment. However with all it would consider them unmapped, since direct junctions all contain an unmapped segment and can't be rescued. In that case indeed 5unique will give more pairs, but the extra pairs are actually indirect contacts. Then it's up to you whether you are fine with your data containing such indirect contacts (i.e. in reality there are at least two ligations in the fragment).

As an example, consider the case here, if the green and blue fragments are actually unmapped: https://pairtools.readthedocs.io/en/latest/parsing.html#rescuing-complex-walks

YingChen94 commented 10 months ago

Why would those 3' end be unmapped (since 5' end are mapped I assume those 3' end segments are not contaminants)? Is it because those 3' end segments are from repetitive regions? The Omni-C data is from the same individual as the reference genome so not due to sequence divergence. But the repetitive elements are indeed >60% in the genome.

Phlya commented 10 months ago

They could just be too short... Without the data can't say much more really

YingChen94 commented 10 months ago

That makes sense. I will have a closer look at the data. Thank you Ilya for your help!!

Phlya commented 9 months ago

Glad that it was helpful, I'll close this for now, feel free to reopen if a more in-depth investigation is still confusing.

golobor commented 9 months ago

@agalitsyna this is an interesting observation re: 5unique vs all

gorliver commented 9 months ago

I also observed 5unique recover more pairs than all. I provided an example data in #186