Allow column names in all dedup backends, and allow sorting by arbitrary columns

Phlya commented 2 years ago

First try to fix https://github.com/open2c/pairtools/issues/161

Now all backends support column names (not numbers even in cython) to define used chrom/pos/strand columns.

golobor commented 2 years ago

quick Q - what would happen if a .pairs file lacks a header? would dedup fail entirely? If yes, could we (or, should we) allow both integer and string inputs?

Phlya commented 2 years ago

See Sasha's comment here https://github.com/open2c/pairtools/issues/161#issuecomment-1315505828

Phlya commented 2 years ago

@golobor have you managed to try it with your data where you needed this?

golobor commented 2 years ago

Yeah, it works, but seems surprisingly slow. Ankit is profiling dedup now, we'll get back to you in a few days

On Fri, Nov 25, 2022, 11:21 Ilya Flyamer @.***> wrote:

@golobor https://github.com/golobor have you managed to try it with your data where you needed this?

— Reply to this email directly, view it on GitHub https://github.com/open2c/pairtools/pull/162#issuecomment-1327292666, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG64CVNZM4QDN5O42HQ5YDWKCHLZANCNFSM6AAAAAASCKOLSY . You are receiving this because you were mentioned.Message ID: @.***>

Phlya commented 2 years ago

Did you remember to sort by the right columns? And I wonder then whether with very high % duplication it might be much slower...

MasterChief1O7 commented 1 year ago

Hi, here are the profiler report for old and new way of deduplication, I think the long time is mainly due to more number of duplicates as mentioned in this thread, because everything else looks the same except the timings.

I sorted it through pairtools sort, so it is not sorted w.r.t. to restriction fragment columns but ideally shouldn't it be same because its order will also be same as the location of the fragments? Also pairtools sort doesn't have an option to submit column names, but if needed I can sort it manually and then rerun the deduplication, let me know.

here are the commands that I used:


sname="nuclei_1_R1.lane1.ce11.01.sorted.pairs.restricted.gz"

python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
    --max-mismatch 3 \
    --mark-dups \
    --output \
        >( pairtools split \
            --output-pairs test.nodups.pairs.gz \
            --output-sam test.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs test.unmapped.pairs.gz \
            --output-sam test.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs test.dups.pairs.gz \
            --output-sam test.dups.bam \
         ) \
    --output-stats test.dedup.stats \
    $sname > pairtools_profile_old.txt

echo "done with old way"

python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
    --max-mismatch 3 \
    --mark-dups \
    --p1 "rfrag1"\
    --p2 "rfrag2"\
    --output \
        >( pairtools split \
            --output-pairs test.nodups.pairs.gz \
            --output-sam test.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs test.unmapped.pairs.gz \
            --output-sam test.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs test.dups.pairs.gz \
            --output-sam test.dups.bam \
         ) \
    --output-stats test.dedup.stats \
    $sname > pairtools_profile_new.txt

I have also attached the bash script for both the command that I used and a sample pairs file too (I generated it manually using pandas and sorting w.r.t. [chrom1,chrom2,pos1,pos2], as I think pairtools sort, because head command was only giving me multimappers and walks), if you want to test something else please feel free to ask.

pairs_sample.pairs.txt pairtools_profile_new.txt pairtools_profile_old.txt dedup_pairsam.sh.txt

Phlya commented 1 year ago

Wow that is a huge slow-down! Thanks for the report. Would be good to profile the KDTree itself in regard to its performance with different % duplicates...

Ah and by the way, when using restriction fragments, I would use --max-mismatch 0 - you don't want to consider neighbouring fragments as duplicates. That in itself might help a little, by the way.

For the sorting, it should be the same, you are right! But perhaps we should allow column names in sort too, anyway.

Phlya commented 1 year ago

Ah, what are the actual numbers that you get with the two approaches? I.e. how many duplicates, and total pairs?

MasterChief1O7 commented 1 year ago

Hi, here are the stats and profiler files for both the cases, this time I used --max-mixmatch 0 for the new case, it did reduce the time but still quite long as compared to the old method. Also if you need the whole pairs file let me know, I can share on slack.

pairtools_profile_new.txt pairtools_profile_old.txt pairtools_new.dedup.stats.txt pairtools_old.dedup.stats.txt

Phlya commented 1 year ago

Thank you! Yeah, could you share the full file please?

golobor commented 1 year ago

A small clarification - we've been using non zero max mismatch, because these are single cell data generated with phi29 followed by sonication

On Mon, Nov 28, 2022, 13:45 Ilya Flyamer @.***> wrote:

Thank you! Yeah, could you share the full file please?

— Reply to this email directly, view it on GitHub https://github.com/open2c/pairtools/pull/162#issuecomment-1329021661, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG64CXJZ4ZKXI2ZCNFW3QDWKSSPRANCNFSM6AAAAAASCKOLSY . You are receiving this because you were mentioned.Message ID: @.***>

Phlya commented 1 year ago

Why would you do this when you are using annotated restrictions sites?..

Phlya commented 1 year ago

@MasterChief1O7 can you try just running the "new" version with --backend cython?

Phlya commented 1 year ago

It is a temporary solution in the long run, but it works for me, and it's even faster than the "old" way.

MasterChief1O7 commented 1 year ago

wow! true, just tried and it worked, just one small thing that I noticed, stats are slightly different, just by 100 I guess, is it an issue? But, ya, some of the stats are quite different (like the cis_ ones)

I just added cython and nothing else

sname="nuclei_1_R1.lane1.ce11.01.sorted.pairs.restricted.gz"

python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
    --max-mismatch 3 \
    --mark-dups \
    --backend cython \
    --output \
        >( pairtools split \
            --output-pairs pairtools_old_cython.nodups.pairs.gz \
            --output-sam pairtools_old_cython.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs pairtools_old_cython.unmapped.pairs.gz \
            --output-sam pairtools_old_cython.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs pairtools_old_cython.dups.pairs.gz \
            --output-sam pairtools_old_cython.dups.bam \
         ) \
    --output-stats pairtools_old_cython.dedup.stats \
    $sname > pairtools_profile_old_cython.txt

echo "done with old way"

python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
    --max-mismatch 0 \
    --mark-dups \
    --backend cython \
    --p1 "rfrag1"\
    --p2 "rfrag2"\
    --output \
        >( pairtools split \
            --output-pairs pairtools_new_cython.nodups.pairs.gz \
            --output-sam pairtools_new_cython.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs pairtools_new_cython.unmapped.pairs.gz \
            --output-sam pairtools_new_cython.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs pairtools_new_cython.dups.pairs.gz \
            --output-sam pairtools_new_cython.dups.bam \
         ) \
    --output-stats pairtools_new_cython.dedup.stats \
    $sname > pairtools_profile_new_cython.txt

grafik

Phlya commented 1 year ago

Mh with --max-mistmatch 0 there shouldn't be any difference I think... Such a huge effect on cis_Xkb+ is weird! It might be saving the columns in the wrong order or something like that...

golobor commented 1 year ago

Why would you do this when you are using annotated restrictions sites?..

b/c the site where the ligation junction is typically not sequenced ("unconfirmed" ligations, as we call them in a parallel discussion), so we can't be sure where exactly restriction happened. With 4bp cutters, consecutive restriction sites can be as close as 10s of bp apart...

Phlya commented 1 year ago

Ok, then actually, you could use only confirmed ligations! That was Sasha's @agalitsyna idea, which really made the data much cleaner.

agalitsyna commented 1 year ago

It's weird that the total number of cis interactions has not really changed (+100), but all cis_X+ has dropped substantially. Looks like one of that backends might interact differently with PairCounter that is producing stats, doesn't it?

agalitsyna commented 1 year ago

@Phlya It might be happening here: https://github.com/open2c/pairtools/blob/3bfe8e38d5581fefee70ea252691675e19c3bab4/pairtools/lib/dedup.py#L436 p1 is passed as position to cython backend, while pandas-based backend always adds "pos1" and "pos2" to the PairCounter here: https://github.com/open2c/pairtools/blob/6d5b4d453c2ae7bf47ddcf27a660fcc7ba3b5020/pairtools/lib/stats.py#L469

Phlya commented 1 year ago

Yeah that's exactly the kind of thing I was thinking about, but haven't managed to look for it! Thanks, I'll investigate/fix this.

golobor commented 1 year ago

Ok, then actually, you could use only confirmed ligations!

Use them for what?.. I'm not sure if I follow - we already have rather little data due to single cell resolution, it doesn't feel like a good idea to throw away a major fraction of it...

Phlya commented 1 year ago

Ok, then actually, you could use only confirmed ligations!

Use them for what?.. I'm not sure if I follow - we already have rather little data due to single cell resolution, it doesn't feel like a good idea to throw away a major fraction of it...

I guess depends on how deeply you sequenced your libraries, but generally since it's amplified with phi29 and randomly fragmented if you sequence deep enough you will directly read through every ligation that happened. Then you don't actually lose any data, you are just sure every contact you describe comes from ligation and not from polymerase hopping (since you actually filter by presence of ligation site sequence at the apparent ligation junction).

golobor commented 1 year ago

ah, that's a great point!

Phlya commented 1 year ago

ah, that's a great point!

Sasha @agalitsyna is the best person to discuss this with, re code :)

Phlya commented 1 year ago

@Phlya It might be happening here:

https://github.com/open2c/pairtools/blob/3bfe8e38d5581fefee70ea252691675e19c3bab4/pairtools/lib/dedup.py#L436

p1 is passed as position to cython backend, while pandas-based backend always adds "pos1" and "pos2" to the PairCounter here:

https://github.com/open2c/pairtools/blob/6d5b4d453c2ae7bf47ddcf27a660fcc7ba3b5020/pairtools/lib/stats.py#L469

Actually, I don't think that's true, cython backend doesn't use add_pairs_from_dataframe, it add them one by one with add_pair, and gives it the right values...

Phlya commented 1 year ago

ah this actually causes the problem in stats, if pos1/2 is not the actual position!

As Sasha mentioned, the dataframe-based stats have hardcoded column names. Should we do the same for the add_pair method?

Phlya commented 1 year ago

TBH for the case of restriction fragment annotation perhaps the "best" solution is simply to use the center of the restriction fragment instead of its index?

Phlya commented 1 year ago

The slowness in scipy/sklearn is super frustrating, it doesn't come from the KDTree at all, ~70% of the runtime is spent on stupid indexing afterwards to filter for matching chromosomes, strands and any other columns. I rewrote it today which makes it a little faster, but still it's terribly slow!

This happens here because of the super high duplication % (94%) the resulting number of comparisons between potential duplicates is super high. For a 100_000 chunk it needs to do 58 million comparisons! Which is completely different to a dataset with low duplication rate. Cython doesn't care about this though, it's fast with any number of duplicates...

Phlya commented 1 year ago

Something is broken in our pyflake8 btw...

Phlya commented 1 year ago

@MasterChief1O7 please try this version with scipy backend if you want, it will be still slower than cython, but much faster than before :)

MasterChief1O7 commented 1 year ago

Just tried this version with scipy, it is much much faster than previous version, only took ~1.5 min rather than ~10 min as in old version. However I still don't understand why there is this slight difference in stats numbers in scipy vs cython?

Phlya commented 1 year ago

cython stats are a little broken now for non-standard columns... if you use rfrag_start1/2 instead of rfrag1/2 it'll be much better, btw

MasterChief1O7 commented 1 year ago

Oh ok, so the scipy ones are correct, right? And will the deduplication results be same if I use rfrag_start1/2 instead of rfrag1/2? I guess in case of max-mismatch = 0, yes but not if I use any other value, right?

Phlya commented 1 year ago

Yes, scipy stats should be correct...

Indeed, with non-zero mismatch when using rfrag it will allow ± this number of restriction fragments, while with with rfrag_start it'll be in base pairs.

Phlya commented 1 year ago

Actually looking at the tests it seems like there non-cython backends have a problem now here... I'm looking into it.

Phlya commented 1 year ago

Fixed now! @MasterChief1O7

Phlya commented 1 year ago

And also found a small preexisting bug where dedup was losing a small portion of pairs, fixed here...

Phlya commented 1 year ago

Actually there are still very strange discrepancies between cython and scipy/sklearn even in just mapping stats... which is bizarre!

cython

total   1498859
total_unmapped  50862
total_single_sided_mapped   63125
total_mapped    1384872
total_dups  1304062
total_nodups    80810
cis 71230
trans   9580

scipy

total   1498859
total_unmapped  50862
total_single_sided_mapped   62525
total_mapped    1385472
total_dups  1304062
total_nodups    81410
cis 71830
trans   9580

Phlya commented 1 year ago

And the pair types also differ just slightly...

golobor commented 8 months ago

@Phlya , sorry for a long break - what's the status of this? Is it ready to be merged?

Phlya commented 8 months ago

Oh I also don't remember any more @golobor . I think the functionality here was all good to go, but there were some weird things I discovered that are described above...

golobor commented 8 months ago

So, indeed, some lines become duplicated in the scipy engine:

(main) [anton.goloborodko@clip-login-0 test_dedup]$ cat pairs_sample.pairs.dedup_scipy.csv | grep NB551430:778:HNLWKBGXM:3:12508:20725:4668
NB551430:778:HNLWKBGXM:3:12508:20725:4668       chrV    7125494 chrV    7125100 -       +       UU      60      60      19523   7123598 7126019 19523   7123598 7126019
NB551430:778:HNLWKBGXM:3:12508:20725:4668       chrV    7125494 chrV    7125100 -       +       UU      60      60      19523   7123598 7126019 19523   7123598 7126019

Phlya commented 8 months ago

Always? Or after this PR?

open2c / pairtools

Allow column names in all dedup backends, and allow sorting by arbitrary columns #162