Closed YingChen94 closed 9 months ago
uu are "rescued" pairs. You can basically treat them the same as UU, we are considering reverting this change and just label them as UU as it used to be. Sorry about the confusion...
Regarding mapped reads and walk policies, --walk-policy all
actually reports more total "reads", because one read can generate multiple pairs... And if additional pairs often are unmapped, you can end up with lower % of mapped pairs. You can check the raw numbers in stats to see if this is the right idea...
Thank you Ilya for your fast reply! It seems that --walk-policy 5unique
actually recovered more pairs. Here are the stats:
--walk-policy 5unique
:
Total Read Pairs 654,871,310 100%
Unmapped Read Pairs 156,646,648 23.92%
Mapped Read Pairs 220,536,058 33.68%
PCR Dup Read Pairs 67,306,526 10.28%
No-Dup Read Pairs 153,229,532 23.4%
No-Dup Cis Read Pairs 76,722,118 50.07%
No-Dup Trans Read Pairs 76,507,414 49.93%
No-Dup Valid Read Pairs (cis >= 1kb + trans) 118,133,051 77.1%
No-Dup Cis Read Pairs < 1kb 35,096,481 22.9%
No-Dup Cis Read Pairs >= 1kb 41,625,637 27.17%
No-Dup Cis Read Pairs >= 10kb 30,129,162 19.66%
--walk-policy all
:
Total Read Pairs 1,084,959,417 100%
Unmapped Read Pairs 381,460,846 35.16%
Mapped Read Pairs 196,295,065 18.09%
PCR Dup Read Pairs 56,691,289 5.23%
No-Dup Read Pairs 139,603,776 12.87%
No-Dup Cis Read Pairs 70,322,699 50.37%
No-Dup Trans Read Pairs 69,281,077 49.63%
No-Dup Valid Read Pairs (cis >= 1kb + trans) 112,044,026 80.26%
No-Dup Cis Read Pairs < 1kb 27,559,750 19.74%
No-Dup Cis Read Pairs >= 1kb 42,762,949 30.63%
No-Dup Cis Read Pairs >= 10kb 28,578,042 20.47%
I'm using the Omni-C to scaffold the genome assembly. Do you think --walk-policy 5unique
is better since it has more total pairs? Any insights are greatly appreciated! Thanks!
Mh I guess it's possible that the unmapped sequences are more often on 3' end of the read, then with 5unique
it would just take the 5' segment. However with all
it would consider them unmapped, since direct junctions all contain an unmapped segment and can't be rescued. In that case indeed 5unique will give more pairs, but the extra pairs are actually indirect contacts. Then it's up to you whether you are fine with your data containing such indirect contacts (i.e. in reality there are at least two ligations in the fragment).
As an example, consider the case here, if the green and blue fragments are actually unmapped: https://pairtools.readthedocs.io/en/latest/parsing.html#rescuing-complex-walks
Why would those 3' end be unmapped (since 5' end are mapped I assume those 3' end segments are not contaminants)? Is it because those 3' end segments are from repetitive regions? The Omni-C data is from the same individual as the reference genome so not due to sequence divergence. But the repetitive elements are indeed >60% in the genome.
They could just be too short... Without the data can't say much more really
That makes sense. I will have a closer look at the data. Thank you Ilya for your help!!
Glad that it was helpful, I'll close this for now, feel free to reopen if a more in-depth investigation is still confusing.
@agalitsyna this is an interesting observation re: 5unique vs all
I also observed 5unique
recover more pairs than all
. I provided an example data in #186
My
mapped.pairs
file has "UU" and "uu" in the last column. What's the difference between them? For example, the first 5 lines after # lines of themapped.pairs
file:The codes are:
I also ran the same codes except using
--walk-policy all
, and the first 5 lines became all "UU":I also noticed that the number of mapped read pairs is lower in
--walk-policy all
(18.09%) compared to--walk-policy 5unique
(33.68%). Shouldn't--walk-policy all
recover more ligations?Any insights are appreciated! Thank you in advance! Ying