Closed Phlya closed 8 months ago
quick Q - what would happen if a .pairs file lacks a header? would dedup fail entirely? If yes, could we (or, should we) allow both integer and string inputs?
See Sasha's comment here https://github.com/open2c/pairtools/issues/161#issuecomment-1315505828
@golobor have you managed to try it with your data where you needed this?
Yeah, it works, but seems surprisingly slow. Ankit is profiling dedup now, we'll get back to you in a few days
On Fri, Nov 25, 2022, 11:21 Ilya Flyamer @.***> wrote:
@golobor https://github.com/golobor have you managed to try it with your data where you needed this?
— Reply to this email directly, view it on GitHub https://github.com/open2c/pairtools/pull/162#issuecomment-1327292666, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG64CVNZM4QDN5O42HQ5YDWKCHLZANCNFSM6AAAAAASCKOLSY . You are receiving this because you were mentioned.Message ID: @.***>
Did you remember to sort by the right columns? And I wonder then whether with very high % duplication it might be much slower...
Hi, here are the profiler report for old and new way of deduplication, I think the long time is mainly due to more number of duplicates as mentioned in this thread, because everything else looks the same except the timings.
I sorted it through pairtools sort, so it is not sorted w.r.t. to restriction fragment columns but ideally shouldn't it be same because its order will also be same as the location of the fragments? Also pairtools sort doesn't have an option to submit column names, but if needed I can sort it manually and then rerun the deduplication, let me know.
here are the commands that I used:
sname="nuclei_1_R1.lane1.ce11.01.sorted.pairs.restricted.gz"
python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
--max-mismatch 3 \
--mark-dups \
--output \
>( pairtools split \
--output-pairs test.nodups.pairs.gz \
--output-sam test.nodups.bam \
) \
--output-unmapped \
>( pairtools split \
--output-pairs test.unmapped.pairs.gz \
--output-sam test.unmapped.bam \
) \
--output-dups \
>( pairtools split \
--output-pairs test.dups.pairs.gz \
--output-sam test.dups.bam \
) \
--output-stats test.dedup.stats \
$sname > pairtools_profile_old.txt
echo "done with old way"
python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
--max-mismatch 3 \
--mark-dups \
--p1 "rfrag1"\
--p2 "rfrag2"\
--output \
>( pairtools split \
--output-pairs test.nodups.pairs.gz \
--output-sam test.nodups.bam \
) \
--output-unmapped \
>( pairtools split \
--output-pairs test.unmapped.pairs.gz \
--output-sam test.unmapped.bam \
) \
--output-dups \
>( pairtools split \
--output-pairs test.dups.pairs.gz \
--output-sam test.dups.bam \
) \
--output-stats test.dedup.stats \
$sname > pairtools_profile_new.txt
I have also attached the bash script for both the command that I used and a sample pairs file too (I generated it manually using pandas and sorting w.r.t. [chrom1,chrom2,pos1,pos2], as I think pairtools sort, because head command was only giving me multimappers and walks), if you want to test something else please feel free to ask.
pairs_sample.pairs.txt pairtools_profile_new.txt pairtools_profile_old.txt dedup_pairsam.sh.txt
Wow that is a huge slow-down! Thanks for the report. Would be good to profile the KDTree itself in regard to its performance with different % duplicates...
Ah and by the way, when using restriction fragments, I would use --max-mismatch 0
- you don't want to consider neighbouring fragments as duplicates. That in itself might help a little, by the way.
For the sorting, it should be the same, you are right! But perhaps we should allow column names in sort too, anyway.
Ah, what are the actual numbers that you get with the two approaches? I.e. how many duplicates, and total pairs?
Hi, here are the stats and profiler files for both the cases, this time I used --max-mixmatch 0
for the new case, it did reduce the time but still quite long as compared to the old method. Also if you need the whole pairs file let me know, I can share on slack.
pairtools_profile_new.txt pairtools_profile_old.txt pairtools_new.dedup.stats.txt pairtools_old.dedup.stats.txt
Thank you! Yeah, could you share the full file please?
A small clarification - we've been using non zero max mismatch, because these are single cell data generated with phi29 followed by sonication
On Mon, Nov 28, 2022, 13:45 Ilya Flyamer @.***> wrote:
Thank you! Yeah, could you share the full file please?
— Reply to this email directly, view it on GitHub https://github.com/open2c/pairtools/pull/162#issuecomment-1329021661, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG64CXJZ4ZKXI2ZCNFW3QDWKSSPRANCNFSM6AAAAAASCKOLSY . You are receiving this because you were mentioned.Message ID: @.***>
Why would you do this when you are using annotated restrictions sites?..
@MasterChief1O7 can you try just running the "new" version with --backend cython
?
It is a temporary solution in the long run, but it works for me, and it's even faster than the "old" way.
wow! true, just tried and it worked, just one small thing that I noticed, stats are slightly different, just by 100 I guess, is it an issue? But, ya, some of the stats are quite different (like the cis_ ones)
I just added cython and nothing else
sname="nuclei_1_R1.lane1.ce11.01.sorted.pairs.restricted.gz"
python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
--max-mismatch 3 \
--mark-dups \
--backend cython \
--output \
>( pairtools split \
--output-pairs pairtools_old_cython.nodups.pairs.gz \
--output-sam pairtools_old_cython.nodups.bam \
) \
--output-unmapped \
>( pairtools split \
--output-pairs pairtools_old_cython.unmapped.pairs.gz \
--output-sam pairtools_old_cython.unmapped.bam \
) \
--output-dups \
>( pairtools split \
--output-pairs pairtools_old_cython.dups.pairs.gz \
--output-sam pairtools_old_cython.dups.bam \
) \
--output-stats pairtools_old_cython.dedup.stats \
$sname > pairtools_profile_old_cython.txt
echo "done with old way"
python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
--max-mismatch 0 \
--mark-dups \
--backend cython \
--p1 "rfrag1"\
--p2 "rfrag2"\
--output \
>( pairtools split \
--output-pairs pairtools_new_cython.nodups.pairs.gz \
--output-sam pairtools_new_cython.nodups.bam \
) \
--output-unmapped \
>( pairtools split \
--output-pairs pairtools_new_cython.unmapped.pairs.gz \
--output-sam pairtools_new_cython.unmapped.bam \
) \
--output-dups \
>( pairtools split \
--output-pairs pairtools_new_cython.dups.pairs.gz \
--output-sam pairtools_new_cython.dups.bam \
) \
--output-stats pairtools_new_cython.dedup.stats \
$sname > pairtools_profile_new_cython.txt
Mh with --max-mistmatch 0 there shouldn't be any difference I think... Such a huge effect on cis_Xkb+ is weird! It might be saving the columns in the wrong order or something like that...
Why would you do this when you are using annotated restrictions sites?..
b/c the site where the ligation junction is typically not sequenced ("unconfirmed" ligations, as we call them in a parallel discussion), so we can't be sure where exactly restriction happened. With 4bp cutters, consecutive restriction sites can be as close as 10s of bp apart...
Ok, then actually, you could use only confirmed ligations! That was Sasha's @agalitsyna idea, which really made the data much cleaner.
It's weird that the total number of cis interactions has not really changed (+100), but all cis_X+ has dropped substantially. Looks like one of that backends might interact differently with PairCounter that is producing stats, doesn't it?
@Phlya It might be happening here: https://github.com/open2c/pairtools/blob/3bfe8e38d5581fefee70ea252691675e19c3bab4/pairtools/lib/dedup.py#L436 p1 is passed as position to cython backend, while pandas-based backend always adds "pos1" and "pos2" to the PairCounter here: https://github.com/open2c/pairtools/blob/6d5b4d453c2ae7bf47ddcf27a660fcc7ba3b5020/pairtools/lib/stats.py#L469
Yeah that's exactly the kind of thing I was thinking about, but haven't managed to look for it! Thanks, I'll investigate/fix this.
Ok, then actually, you could use only confirmed ligations!
Use them for what?.. I'm not sure if I follow - we already have rather little data due to single cell resolution, it doesn't feel like a good idea to throw away a major fraction of it...
Ok, then actually, you could use only confirmed ligations!
Use them for what?.. I'm not sure if I follow - we already have rather little data due to single cell resolution, it doesn't feel like a good idea to throw away a major fraction of it...
I guess depends on how deeply you sequenced your libraries, but generally since it's amplified with phi29 and randomly fragmented if you sequence deep enough you will directly read through every ligation that happened. Then you don't actually lose any data, you are just sure every contact you describe comes from ligation and not from polymerase hopping (since you actually filter by presence of ligation site sequence at the apparent ligation junction).
ah, that's a great point!
ah, that's a great point!
Sasha @agalitsyna is the best person to discuss this with, re code :)
@Phlya It might be happening here:
p1 is passed as position to cython backend, while pandas-based backend always adds "pos1" and "pos2" to the PairCounter here:
Actually, I don't think that's true, cython backend doesn't use add_pairs_from_dataframe
, it add them one by one with add_pair
, and gives it the right values...
ah this actually causes the problem in stats, if pos1/2
is not the actual position!
As Sasha mentioned, the dataframe-based stats have hardcoded column names. Should we do the same for the add_pair
method?
TBH for the case of restriction fragment annotation perhaps the "best" solution is simply to use the center of the restriction fragment instead of its index?
The slowness in scipy/sklearn is super frustrating, it doesn't come from the KDTree at all, ~70% of the runtime is spent on stupid indexing afterwards to filter for matching chromosomes, strands and any other columns. I rewrote it today which makes it a little faster, but still it's terribly slow!
This happens here because of the super high duplication % (94%) the resulting number of comparisons between potential duplicates is super high. For a 100_000 chunk it needs to do 58 million comparisons! Which is completely different to a dataset with low duplication rate. Cython doesn't care about this though, it's fast with any number of duplicates...
Something is broken in our pyflake8 btw...
@MasterChief1O7 please try this version with scipy backend if you want, it will be still slower than cython, but much faster than before :)
Just tried this version with scipy, it is much much faster than previous version, only took ~1.5 min rather than ~10 min as in old version. However I still don't understand why there is this slight difference in stats numbers in scipy vs cython?
cython stats are a little broken now for non-standard columns... if you use rfrag_start1/2 instead of rfrag1/2 it'll be much better, btw
Oh ok, so the scipy ones are correct, right? And will the deduplication results be same if I use rfrag_start1/2 instead of rfrag1/2? I guess in case of max-mismatch = 0, yes but not if I use any other value, right?
Yes, scipy stats should be correct...
Indeed, with non-zero mismatch when using rfrag it will allow ± this number of restriction fragments, while with with rfrag_start it'll be in base pairs.
Actually looking at the tests it seems like there non-cython backends have a problem now here... I'm looking into it.
Fixed now! @MasterChief1O7
And also found a small preexisting bug where dedup was losing a small portion of pairs, fixed here...
Actually there are still very strange discrepancies between cython and scipy/sklearn even in just mapping stats... which is bizarre!
cython
total 1498859
total_unmapped 50862
total_single_sided_mapped 63125
total_mapped 1384872
total_dups 1304062
total_nodups 80810
cis 71230
trans 9580
scipy
total 1498859
total_unmapped 50862
total_single_sided_mapped 62525
total_mapped 1385472
total_dups 1304062
total_nodups 81410
cis 71830
trans 9580
And the pair types also differ just slightly...
@Phlya , sorry for a long break - what's the status of this? Is it ready to be merged?
Oh I also don't remember any more @golobor . I think the functionality here was all good to go, but there were some weird things I discovered that are described above...
So, indeed, some lines become duplicated in the scipy engine:
(main) [anton.goloborodko@clip-login-0 test_dedup]$ cat pairs_sample.pairs.dedup_scipy.csv | grep NB551430:778:HNLWKBGXM:3:12508:20725:4668
NB551430:778:HNLWKBGXM:3:12508:20725:4668 chrV 7125494 chrV 7125100 - + UU 60 60 19523 7123598 7126019 19523 7123598 7126019
NB551430:778:HNLWKBGXM:3:12508:20725:4668 chrV 7125494 chrV 7125100 - + UU 60 60 19523 7123598 7126019 19523 7123598 7126019
Always? Or after this PR?
First try to fix https://github.com/open2c/pairtools/issues/161
Now all backends support column names (not numbers even in cython) to define used chrom/pos/strand columns.