nanoporetech / pore-c

Pore-C support
Mozilla Public License 2.0
52 stars 5 forks source link

Why use ncls + pandas instead of pyranges? #31

Closed endrebak closed 4 years ago

endrebak commented 4 years ago

Just curious.

Also, here you use the function first_overlap_both:

https://github.com/nanoporetech/pore-c/blob/master/src/pore_c/model.py#L339

Is any overlapping interval good? I think the naming is confusing. It is certain to give the longest overlap with the leftmost start coordinate. Just FYI

eharr commented 4 years ago

@endrebak, firstly thanks for both ncls and pyranges - they're super-useful packages.

It's been a while since I added the ncls dependency, but from what I remember the pyranges conda package came with a very large dependency list some of which conflicted with versions of packages that I was using. Looking at the bioconda recipe now it seems like that would no longer be an issue, so it would probably be best to replace the overlap logic with pyranges directly. I've created an issue (#32) to track this.

Is any overlapping interval good? I think the naming is confusing. It is certain to give the longest overlap with the leftmost start coordinate. Just FYI

For each alignment interval on the pore-C read we need to assign a single restriction fragment. We settled on selecting the left-most fragment, but there might be better ways of doing this.

endrebak commented 4 years ago

Yeah, I have actually reduced the number of dependencies in pyranges and am hesitant to add more. There are certain functionalities which depend on specific packages (fisher for fast fisher exact, pybigwig for writing to bigwig etc.), but these are optional dependencies now to reduce the risk of confilicting packages or the install failing due to a failing dependency.

Thanks btw :)

endrebak commented 4 years ago

Also, pyranges is multicpu for all functions and can easily run unary and binary functions you define yourself in parallel. And it has nearest/k-nearest queries. I actually made the NCLS a distinct package for the non-genomicists (but of course I do not mind if people prefer to roll their own solutions with NCLS and pandas.)

On Tue, Nov 12, 2019 at 4:33 PM Eoghan Harrington notifications@github.com wrote:

@endrebak https://github.com/endrebak, firstly thanks for both ncls and pyranges - they're super-useful packages.

It's been a while since I added the ncls dependency, but from what I remember the pyranges conda package came with a very large dependency list some of which conflicted with versions of packages that I was using. Looking at the bioconda recipe now it seems like that would no longer be an issue, so it would probably be best to replace the overlap logic with pyranges directly. I've created an issue (#32 https://github.com/nanoporetech/pore-c/issues/32) to track this.

Is any overlapping interval good? I think the naming is confusing. It is certain to give the longest overlap with the leftmost start coordinate. Just FYI

For each alignment interval on the pore-C read we need to assign a single restriction fragment. We settled on selecting the left-most fragment, but there might be better ways of doing this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nanoporetech/pore-c/issues/31?email_source=notifications&email_token=AEHURUXZZRKTM7BDV7ZFWU3QTLEE3A5CNFSM4JLPG2T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED2UO6I#issuecomment-552945529, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHURUV2NQJOD6PJCGQ75Q3QTLEE3ANCNFSM4JLPG2TQ .