lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
210 stars 39 forks source link

Verbose, the process of running #86

Closed ypriverol closed 1 year ago

ypriverol commented 1 year ago

@lazear We have started to implement SAGE in quantms. We are running the first iteration and I observed that SAGE do not have any verbose option, doesn't need to be big but would be nice to have some messages that describe where SAGE is on each step. If those messages are there, but we are not getting it, let us know How I can get them.

Regards Yasset

lazear commented 1 year ago

Hi Yasset, that is exciting news!

Can you provide more details on what you would like to see?

The design of Sage (reading multiple files in parallel, and potentially cycling through multiple batches) makes it somewhat difficult to report progress (each thread will process spectra from different files intermingled - there is no concept of a "file" being completely finished, because they are not searched one-at-a-time). Incrementing a counter then needs to be atomic operations, and an additional thread is needed to check progress, lock a handle to stdout, and write progress. An early version of Sage actually included a progress bar... but it had a massive impact on performance. ^ All of that goes to say that I am open to the idea of providing more verbose output, but only if it does not have a significant performance impact. E.g. if sacrificing a thread just to report search progress is what is needed (especially when searches are so fast), it will not happen.

In the mean time, some additional information can be provided by running Sage with the SAGE_LOG environment variable set to "trace" (export SAGE_LOG=trace)

ypriverol commented 1 year ago

Well, I was thinking and independently of the parallelization, would be great to have some kind of counter of number of spectra search. Something like every 10k spectra you show something like:

100 spectra searched... 
10k spectra searched.. 
20k spectra searched...

The main idea is to give some sense to other tools to know that something is happening. If you dont output anything if my job for example is running for 2hs I don't know what is going on, if you are still searching, if you are having problems with writing or any other problem.

lazear commented 1 year ago

I will investigate if it's possible.

On a side note - the actual search stage is an infallible step (from a hard-error point of view). Nothing can go wrong, because there is no IO, and Rust is memory safe 😃.

Therefore, errors can only occur before or after searching (when IO happens), so if the job is still running, then it's because it's still searching. Sage will also not crash if mzML file paths are not present - it will report that they are missing and continue. Even passing a non-mzML (e.g. a random text file) will not cause a crash.

ypriverol commented 1 year ago

The problem is that we are running sage over a conda -> docker -> singularity, mainly running singularity. All jobs that we have run so far on multiple files (max 10) get frozen for hours which doesn't look correct, then because we don't have any logging we can't see what is happening.

lazear commented 1 year ago

If you're running a standard narrow-tolerance search, I expect something is very messed up - typically I see somewhere between 1-5k spectra per CPU-second depending on parameters. Have you checked resource usage (htop, etc?). How many CPU cores are being provisioned? How much RAM?

lazear commented 1 year ago

OK, looks like just reporting spectra and not having a progress bar can be done with little/no impact. Is something like the below acceptable?

[2023-09-07T18:56:26Z TRACE sage] - PXD001468/b1906_293T_proteinID_01A_QE3_122212.mzML: read 67599 spectra
[2023-09-07T18:56:27Z INFO  sage]  - file IO:     9251 ms
[2023-09-07T18:56:28Z TRACE sage]  - searched 10000 spectra (12820 spectra/s)
[2023-09-07T18:56:28Z TRACE sage]  - searched 20000 spectra (16806 spectra/s)
[2023-09-07T18:56:29Z TRACE sage]  - searched 30000 spectra (18633 spectra/s)
[2023-09-07T18:56:29Z TRACE sage]  - searched 40000 spectra (20140 spectra/s)
[2023-09-07T18:56:29Z INFO  sage]  - search:      2085 ms (20017 spectra/s)
ypriverol commented 1 year ago

That looks absolutely great, you can enable that through a variable verbose in case we want more control over it.

lazear commented 1 year ago

Does the environment variable approach work? That is the easiest for me to implement, since it hooks into the logging package I'm using

ypriverol commented 1 year ago

Im testing now.