hasindu2008 / squigulator

a tool for simulating nanopore raw signal data
https://hasindu2008.github.io/squigulator
MIT License
58 stars 3 forks source link

Add `end_reason` auxillary field #13

Open esteinig opened 5 months ago

esteinig commented 5 months ago

Thanks so much for making such a useful tool @hasindu2008 et al. - it's been a delight to work with!

I unfortunately needed to do a Blow5 to Pod5 (...) conversion with blue-crab ( :heart: @Psy-Fer ) to use outputs in the latest Dorado version (0.5.3). However, the conversion broke because of the missing end_reason field. I simply added header and field to the Slow5, which enables the conversion.

This is really just for convenience, but would it be possible to add the end_reason header and field e.g. with a signal_positive enum variant to the output? I am unfortunately not familiar enough with C to add the field myself, but more than happy to open a PR if you can point me in the right direction :)

hasindu2008 commented 5 months ago

Would you be able to use buttery-eel with dorado-server? https://github.com/Psy-Fer/buttery-eel This is how we have been using Dorado recently with S/BLOW5.

But yes, I can add a signal_positive to squigulator. This end_reason field in FAST5 used to change too many times in FAST5 and wasn't always reflective of the actual end reason, so that is why I did not include it in Squigulator. But given that now in POD5 end_reason has seemingly become a mandatory field, I will add it soon

esteinig commented 5 months ago

Thanks so much Hasindu. It's not a rush, mostly for convenience to be honest! I suppose it was a good motivator to familiarize myself with the slow5-rs library, which has been quite delightful to work with as well.

I did look at buttery-eel, but unfortunately it was a little too complex to wrap into my code base - it looks great otherwise! I have been working much more with Slow5/Blow5 recently. Thanks to all of you for making the effort, it's been a nightmare to try and do the same with POD5.

hasindu2008 commented 5 months ago

I have very limited rust knowledge and slow5-rs is kindly developed and maintained by @bsaintjo who seems to have done a phenomenal job. Kudos to @bsaintjo!

Yeh, simplify and usability is a key reason other than performance and file size, that I put effort into BLOW5 as described here. It is encouraging to hear that the effort is being paid off. I myself saved a lot of time by not having to spend time to fix each and every signal analysis tool that I developed, wherever a breaking change happens in POD5.

@Psy-Fer is developing buttery-eel and we are open to hearing about complexities you encountered, and see if we can simplify it if possible (except complexities in the basecall-client from ONT that we do not have control over). One method is to provide buttery-eel with the latest dorado-server as a binary package which I call a snakeball, something I am keen to explore if there is a potential usecase, for instance this.

esteinig commented 5 months ago

Absolutely, kudos to @bsaintjo - really nice design patterns, learning a lot about how to handle C library integration!

Really interesting idea to snakeball the dorado-server binary with buttery-eel. Do you prefer the servers because of relative stability of the API rather than tracking changes to POD5 and Dorado? I suppose my approach is fairly... purist, so it's not so much a complexity from your end (amazing work) but more a desire to bundle everything into a shippable binary with minimal overhead ^^

On that note, I have long been wondering if there is a specification for the Guppy or Dorado server API that the underlying C (?) and Python client libraries use. I would love to cut these out from another code base that otherwise would not require the client package dependencies and could be compiled without much overhead.

Psy-Fer commented 5 months ago

Hey,

Yea documentation for the API you have to pull from the API itself. As for getting the builds, they are distributed from minknow builds, and then get shipped as a python lib on pypi that hook into the Dorado server API calls through a ipc/tcp client connection. It's not all that fun to develop, and frankly buttery-eel shouldn't exist. It's why has such a silly name. Ont should have just adopted slow5. But oh well.

Buttery-eel can do whatever minknow live basecalling can do. So while standalone Dorado is nice, it's not really "production ready" till it ports into minknow. This has kept buttery-eel pretty reliable, especially now they have decoupled Dorado development and Dorado server releases.

esteinig commented 5 months ago

Hey James, long time, nice to get back in touch after me disappearing from social media completely :) I do miss your SciFi book recommendations ^^

I thought as much, will have to dig into the libraries at some stage - gave up the last time it came up in the interest of time, totally understand the "not fun to develop" part. Hard agree on the Slow5 adoption.

Really cool work with wrapping it all up with buttery-eel though (and a great name). Blue-crab is also saving me from terrible POD5 hacking in Rust. I suppose for my current application it's enough to wrap non-production grade dorado for simplicity, but will keep it in mind - especially now that my tools are becoming more Slow5/Blow5 compatible!

emics57 commented 3 months ago

Hello, I am also trying to convert the BLOW5 file to a POD5 using blue-crab, but I think I am getting the same error:

09-Apr-24 10:42:50 - blue-crab - [INFO]: single2single: 1 s/blow5 file detected as input. Writing 1:1 s/blow5->pod5 to file: out_signal.pod5
09-Apr-24 10:42:50 - blue-crab - [INFO]: Opening s/blow5 file: out_signal.blow5
[slow5_get_aux_enum_labels::ERROR] No enum auxiliary type exists. At src/slow5.c:1458
09-Apr-24 10:42:50 - pyslow5 - [WARNING]: get_aux_enum_labels enum_labels is NULL

@esteinig I was wondering how did you add the header and field to the BLOW5 file to solve this issue?

esteinig commented 3 months ago

Hey @emics57 I can put a quick binary with converter together for you. Are you running this on a Linux system?

emics57 commented 3 months ago

Much appreciated!! Yes, on a Linux system.

hasindu2008 commented 3 months ago

Also an update from @Psy-Fer, he has implemented blue-crab to set the end_reason to uknown if it doesn't exist and proceed with the conversion. See https://github.com/Psy-Fer/blue-crab/issues/9. This way it not only will support squigulator-generated BLOW5 conversions, but also any BLOW5s converted from old FAST5 files when this end_reason attribute never existed. In the next release, I am going to implement the end_reason into squigulator too.

esteinig commented 3 months ago

Ah thanks @hasindu2008 @Psy-Fer better than running some sketchy binary 👀 I'll leave you with the update then!

Psy-Fer commented 3 months ago

Yea be sure to try the dev branch to get that functionality while still being tested. I'll do a release and update pypi once the extensive testing has been done.

hasindu2008 commented 3 months ago

OK, just cross reference as this discussion is scattered to blue-crab too https://github.com/Psy-Fer/blue-crab/issues/9

hasindu2008 commented 3 months ago

@esteinig @emics57

If you compile squigulator from the dev branch, and specify the option --ont-friendly=yes it should be pod5 conversion compatible.

When you specify --ont-friendly=yes it will add a dummy end_reason and create fake UUID for read IDs so.

If you encounter issues let me know, thanks.

esteinig commented 3 months ago

Fantastic, very much appreciated @hasindu2008 and @Psy-Fer

emics57 commented 3 months ago

@hasindu2008, Thank you for the quick update!! The conversion is working for me now when using squigulator from the dev branch!