lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
201 stars 38 forks source link

Feature request: move mod input, mod runner, mod output into sage_core #114

Closed neonelemental closed 5 months ago

neonelemental commented 5 months ago

Hello Michael,

I've been experimenting with running sage using a rust API built on tower/tokio/axum. I am in a good place with a fork, but to get things semi-working, I had to move some code around as a proof-of-concept that I could get this library running in a web app environment.

I wanted to get your thoughts on moving some of the building blocks of the CLI into core, or alternatively publishing a lib target for the sage-cli crate (the path I took on my fork, as it was a little less work than refactoring the codebase).

lazear commented 5 months ago

All of the IO stuff was purposefully moved out of sage_core to enable compiling it for targets that don't support tokio (like WASM, https://github.com/lazear/sage/issues/77) - this has the side benefit of making the sage_core library pure from an IO perspective, but does complicate things.

In light of this, publishing a lib target for the cli crate is the best path forward. I would caution that this might not be the best way to interface with axum - sage_core is fully synchronous and blocking, and designed to use as much compute resources as possible (maximizing RAM and CPU usage).

neonelemental commented 5 months ago

Thanks for your reply. I did notice that it is blocking, and had to wrap it in a tokio task to get it working with axum. I spent about 3 days just hacking something together as a proof of concept.

I appreciate the information with regards to memory and cpu usage. If you are open to contributions, my team would be interested in implementing quality-of-life features for a web based environment.

Let me know if you have any thoughts about that -- maybe just maintaining a fork on our end is okay but we would like to give back, since this is such a cool and impressive library. Just not sure in what capacity you'd accept those kinds of contributions.

neonelemental commented 5 months ago

From that linked ticket #77

We could either put tokio behind a feature flag, or just lift the mzML parsing and fasta parsing out of the sage_core crate entirely - I think this might be the cleanest/most ideologically pure way to do it. I will play around with this when I get some free time and see how it feels - please reach out when you are ready and we can have a quick chat to figure out any other internal API changes that might need to be made.

I think it makes sense to add a feature flag for tokio support. I would be interested in helping implement that if you're open to something like that.

lazear commented 5 months ago

If you are open to contributions, my team would be interested in implementing quality-of-life features for a web based environment. Let me know if you have any thoughts about that -- maybe just maintaining a fork on our end is okay but we would like to give back, since this is such a cool and impressive library. Just not sure in what capacity you'd accept those kinds of contributions.

What kind of QoL features do you have in mind? I have to imagine that wanting to directly interface with the rust codebase (as opposed to say, spinning off a background job) from a web-based environment is somewhat niche, but I am open to accepting contributions if they make sense! For instance, building the main executable as a library and just providing a bin\main.rs seems like a straightforward move that doesn't make the codebase harder to work with for the majority use case.

I think it makes sense to add a feature flag for tokio support. I would be interested in helping implement that if you're open to something like that.

At this point, all of the mzML parsing and IO stuff has already been lifted out - it's somewhat more painful to work with now, but it does give a clear separation of concerns. I'm not sure if it's worth the hassle to move that code back into the sage_core library

neonelemental commented 5 months ago

Hey, apologies for the late reply. Working on this proof-of-concept is a side project I pick up and put down every few days. I need to ramp back up on what I was doing before I posted this and see if I can give you a good answer to your first question.

On the second part, that makes sense. I'm very new to Rust, so a lot of this is new to me, especially how Rust supports async operations.

Will post back soon. Thanks for your reply, Michael!

lazear commented 5 months ago

No rush on my part :)

neonelemental commented 5 months ago

I was able to jump back in today, and remembered that you mentioned that Sage is thirsty for RAM and CPU. So I cranked the settings on my Docker container to use 60gb of ram and 8 cpus and I started to get output from Sage on smaller MZML files and FASTA files.

I am going to close this for now, because I think this is a successful POC for what we wanted to achieve.

This is an awesome tool, and I want to thank you for your time and effort. I am sure I will be back with more questions/potential contributions. For now we are going to put a pin in our POC since we are prioritizing other features with what we are building.

neonelemental commented 5 months ago

Ah, one last thing @lazear ...

I ended up having to add a couple of things to our fork to support the API proof of concept.

First, I added a from_json to the impl of the Input struct to support setup using JSON, since our API is taking the CLI's JSON format over the network rather than reading it from a file. We might be able to write to a temp file or something to get around that, but it might be useful to be able to just generate the Input using a JSON string. If you're open to that idea, I would be more than happy to open a PR to add this function.

And lastly, I published the CLI crate so that it could be added to the Cargo.toml. If you're open to adding this, we could move off our fork.

Here's the changes I made:

/// crates/sage-cli/src/input.rs

pub fn from_json(json_content: String) -> anyhow::Result<Self> {
        return Ok(
            serde_json::from_str(&json_content).with_context(|| "Failed to parse input JSON")?
        );
    }

and then added a lib.rs with the pub mod for the modules in the crate and updated the toml:


# crates/sage-cli/Cargo.toml

[lib]
name = "sage"
path = "src/lib.rs"

[[bin]]
name = "sage"
path = "src/bin/sage.rs"

I'm very new to Rust, so there are likely better ways to implement these things. Let me know what you think!

lazear commented 5 months ago

mentioned that Sage is thirsty for RAM and CPU. So I cranked the settings on my Docker container to use 60gb of ram and 8 cpus and I started to get output from Sage on smaller MZML files and FASTA files.

Just to clarify, you can certainly run it with fewer resource (see https://sage-docs.vercel.app/ci for example RAM usage) - Sage is just designed to hit 100% CPU utilization, so it might be incompatible with the philosophy of async web servers.

And lastly, I published the CLI crate so that it could be added to the Cargo.toml What do you mean by published?

Overall, the changes above seem reasonable!

neonelemental commented 5 months ago

By published I just mean that you can use it as a crate in a project like this:

[dependencies]
sage-cli = { git = "https://github.com/mass-matrix/sage.git", branch = "mfk/create-sage-cli-lib" }

Ideally, I could swap that dependency for this:

sage-cli = { git = "https://github.com/lazear/sage.git", tag = "x.x.x" }