nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
116 stars 6 forks source link

API for modkit extract read_calls_path output? #194

Closed billytcl closed 2 weeks ago

billytcl commented 1 month ago

Is there an API to get the output of modkit extract's read_calls_path (eg, mod calls after thresholding) to get per read tables in an iterator format?

The reason why I want to do this is because I'm working with some huge datasets, and parsing on a per-read basis is really prohibitively slow (also uses a crazy amount of disk space). It would be way easier to parse it on the fly in Python by iterating per read.

Considering that it writes on a per read basis anyway, do you have a code example where I can capture the output and do all the parsing live instead of writing it to disk? I'm imaging per read to output a table, then doing something to it, then iterating to the next read.

ArtRand commented 3 weeks ago

Hello @billytcl,

I completely understand. Give me a few and I'll make you a build that can stream the read-calls to stdout similarly to how the "raw extract" can currently. I'm also experimenting with different output formats in an effort to speed up modkit extract. Depending on what you're doing, maybe I can add it as a filter on the output.

billytcl commented 3 weeks ago

Thanks! It would be useful if I can get them already thresholded!

On Mon, Jun 3, 2024 at 8:41 AM Arthur Rand @.***> wrote:

Hello @billytcl https://github.com/billytcl,

I completely understand. Give me a few and I'll make you a build that can stream the read-calls to stdout similarly to how the "raw extract" can currently. I'm also experimenting with different output formats in an effort to speed up modkit extract. Depending on what you're doing, maybe I can add it as a filter on the output.

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/modkit/issues/194#issuecomment-2145542613, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPHYT5EPKZ35ERCGJUMNB3ZFSFC7AVCNFSM6AAAAABITOUZDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVGU2DENRRGM . You are receiving this because you were mentioned.Message ID: @.***>

ArtRand commented 3 weeks ago

@billytcl by "already thresholded" you mean just emit the calls where the fail coluimn is false?

billytcl commented 3 weeks ago

Yes that’s right!

On Mon, Jun 3, 2024 at 9:56 AM Arthur Rand @.***> wrote:

@billytcl https://github.com/billytcl by "already thresholded" you mean just emit the calls where the fail coluimn is false?

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/modkit/issues/194#issuecomment-2145698674, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPHYTYCMCNWOXMOFSSBAQTZFSN25AVCNFSM6AAAAABITOUZDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVGY4TQNRXGQ . You are receiving this because you were mentioned.Message ID: @.***>

ArtRand commented 3 weeks ago

Hello @billytcl,

Please find attached a build that allows the following command that will stream the "read calls" table to stdout:

$ modkit extract ${modbam} null --read-calls stdout [--pass] [--no-header]

Perhaps self-explanatory but --pass will restrict the read calls table to only "passing" (i.e. thresholded) calls and --no-header will omit the first header line (sometimes this makes parsing easier).

These changes will make it into the release later this week as well. modkit_dev1393012_centos7_x86_64.tar.gz

billytcl commented 3 weeks ago

Thank you! Do you have any ideas on how to effectively parse it programmatically as an iterator in Python? Perhaps I can read line by line on stdin and keep track of the read name? Or do you have a better idea?

Billy

On Tue, Jun 4, 2024 at 3:19 PM Arthur Rand @.***> wrote:

Hello @billytcl https://github.com/billytcl,

Please find attached a build that allows the following command that will stream the "read calls" table to stdout:

$ modkit extract ${modbam} null --read-calls stdout [--pass] [--no-header]

Perhaps self-explanatory but --pass will restrict the read calls table to only "passing" (i.e. thresholded) calls and --no-header will omit the first header line (sometimes this makes parsing easier).

These changes will make it into the release later this week as well.

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/modkit/issues/194#issuecomment-2148504750, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPHYT5664KYYP2I6I3VKJLZFY4O3AVCNFSM6AAAAABITOUZDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBYGUYDINZVGA . You are receiving this because you were mentioned.Message ID: @.***>

ArtRand commented 3 weeks ago

@billytcl

That's basically what I do, aggregate the lines together until the read_id changes. If I'm doing an expensive computation on the batch of lines, I send that batch on a channel to another thread pool to process in parallel, then send the results to an aggregator. Feel free to email me art.rand[at]nanoporetech.com.

ArtRand commented 2 weeks ago

@billytcl this change has been released in v0.3.1rc1