Allow passing through `raw` files if DIA is used

jpfeuffer commented 2 years ago

Description of feature

Currently we are always converting to mzML although some tools allow reading raw files (e.g. DIANN and I think even Comet) We only need to make sure that the mzml statistics module can also read raw files. There are Thermo raw parsing libraries for python out there. Or we convert just for QC while DIANN is running and discard intermediate results.

daichengxin commented 2 years ago

When running on Linux (native builds, not Wine), only .d, .mzML, and .dia data are supported for DIA-NN.

jspaezp commented 1 year ago

Hello! I am sorry to resurrect this issue but I am working on a fork that enables bruker data support (for DIA) and was wondering if there is any interest in adopting the feature?

For this it actually bypasses the .mzML for DIANN, using the .d directly.

Let me know if you would like the PR and we can open an issue discussing the design choices to be made for the workflow (arguments/supported flows/error messaging ...)!

@ypriverol @daichengxin @jpfeuffer

jpfeuffer commented 1 year ago

Hi! Cool, yes, sure we would be interested. Some points to consider:

Clear error messages if this is not supported on a specific platform. Including the docs.
Should work with tools that are on bioconda (no idea about the vendor licenses for the libraries to read their formats)
We are extracting statistics from mzML for QC. You could (with increasing preference but increasing difficulty)
- a. Completely disable QC report and warn
- b. Still convert to mzML for QC
- c. Use some package to extract these stats directly from the .d files

jspaezp commented 1 year ago

Right now it does convert to mzml for qc (in the data prep stage) but passes the .d to the rest of the pipeline (option B). This works fine for DIA but more likely than not will break every DDA aspect of the pipeline (which I have not tested).

Here are some more concrete things to consider.

So the options I see are to EITHER
- Disable .d + DDA (check if there are .d files labelled for DDA and stop if so)
- Check if the downstream tools on DDA support direct .d reading and enable/disable them
The tool I am using for .mzml conversion is distributed as a docker image, I am aware that one can convert images to singularity containers, is there anyone who could help me implement that? (same for conda, the underlying tool is this: https://github.com/mafreitas/tdf2mzml which is pure python+linked libraries, we could talk with the author to add conda+singularity as distributions, I am also open to alternatives). I just noticed it does not have a license but i would not be surprised if it was this same license https://github.com/MannLabs/alphatims/blob/master/LICENSE-THIRD-PARTY.txt (so it should be pretty liberal ...)
My specific use case REQUIRES supporting tar-archived .d (.d.tar) files [technically a directory] (and right now its the only implemented path, but it would be very easy to enable raw .d files). I noticed that in general .mzML.gz files are not supported in the pipeline, so I am not sure if this pattern would fit, would this additional branching step be ok with the project?
last but not least, I am not 100% sure how testing is handled here, would you mind pointing me to documentation on how to add those?

On a marginally related note, since predicting the spectral lib is a bottleneck step in the analysis process and in theory should be the same if the same .fasta and digestion+mod+fragmentation parameters are being used, It would be great to have a 'diann_speclib' parameter, that would use that speclib instead of predicting one (I can also make this happen in a different PR).

Best

Sebastian

jpfeuffer commented 1 year ago

Hi!

Okay, option B sounds good.

the first option sounds good for now, too. In the long run, we can try to pass the converted mzML files to the rest of the pipeline in the case of DDA. Should not be too hard to change this logic. Or maybe OpenMS starts supporting these files directly in the future, too.
Yes, conda package would be great. Otherwise, you would need to warn that this is container-only and it's a bit against the nf-core guidelines. I would need to see what the policy would be for non-conda tools. Conversion to singularity should be automatic. You can just give singularity docker URLs.
Yes, I think it would be great to support both folders and files. Using a tar'd folder as input is the way to go for folders I would say. I have not passed folders to nextflow, yet. Not sure if this would work out well. I know that the DDA part should support mzml.gz in theory. Not sure if the nextflow logic handles this correctly, though. I am not sure about DIANN.
You basically need to upload (or send me files and I upload them). Then you need to put an SDRF or design to the test-data repo branch: https://github.com/nf-core/test-datasets/tree/quantms. Now you need to create a test profile that uses this data: https://github.com/nf-core/quantms/blob/master/conf/test_dia.config (E.g. test_dia_bruker). Lastly, you need to enable the test config that uses these inputs in the GitHub actions workflow: https://github.com/nf-core/quantms/blob/master/.github/workflows/ci.yml#L37

Your related note also sounds reasonable in a separate PR. Do we output the predicted spectral lib as a final result? If not we should. Would it make sense to also add a "predict_only" parameter in which case the pipeline stops after prediction? (It might be a bit weird, though because then the main required input would not be used and not unconditionally required anymore ;) ).

ypriverol commented 1 year ago

First, welcome to the project @jspaezp. Agree with @jpfeuffer points.

You can work with the .d files not need to convert to mzML, actually the mzMLs for the pmultiqc were replaced in the latest version by the mzml statistics files that are generated in the following step https://github.com/bigbio/quantms/blob/dev/modules/local/mzmlstatistics/main.nf. You don't need to extend it i nthe first iteration, but in the next step you can extend the mzmmlstatistics to consume .d files and produce the same output. With this we will have a full pmultiqc. We are happy to work with .d files and with discussions with Vadim he mentioned that some mzMLs will not work well with DIANN, then .d is fine.

Happy to have you onboard!!!

jspaezp commented 1 year ago

Hello @ypriverol Thanks a lot!

I don't think I understand what you mean when you mention about changing the stats step (I might have forked the repo after the change was done).
I agree, in my experience, depending on how the mzml is generated, it might either not maintain all the information or not maintain all the connections in the underlying data, making the usage of the data a lot harder for the search engine. (which is why I am adding this feature :P)
The last thing I wanted to ask was on the documentation side of things. Could you point me to where the long form docs are located? (is this >> https://github.com/bigbio/quantms/blob/readthedocs/docs/formats.rst << branch auto-generated or do I generate a separate PR there to update those docs? ... on a similar matter, for contributions, should I PR agains bigbio/dev or nf-core/dev ?)

(Update, all of the search aspect and DIA-NN related outputs works fine RN but I am patching the file conversion aspect (since in some places, the suffix does not match, due to it being changed ... somewhere .. to .mzML in the experimental design).

Best!

ypriverol commented 1 year ago

Hello @ypriverol Thanks a lot!

I don't think I understand what you mean when you mention about changing the stats step (I might have forked the repo after the change was done).

For the pmultiqc reports we generate at the end of the workflow, we use the mzmls to detect the number of MS2, % of identified MS2, charge state, etc. In the step that I mentioned, we read every mzML and generate a tsv with the statistics needed by the pmultiqc. In the first implementation, you are doing with .d we don't need to implement that (then, the pmultiqc should be disabled by default), but in future implementations we should enable .d parsing. @WangHong007 @daichengxin I'm correct here?

I agree, in my experience, depending on how the mzml is generated, it might either not maintain all the information or not maintain all the connections in the underlying data, making the usage of the data a lot harder for the search engine. (which is why I am adding this feature :P)

👍

The last thing I wanted to ask was on the documentation side of things. Could you point me to where the long form docs are located? (is this >> https://github.com/bigbio/quantms/blob/readthedocs/docs/formats.rst << branch auto-generated or do I generate a separate PR there to update those docs? ... on a similar matter, for contributions, should I PR agains bigbio/dev or nf-core/dev ?)

All the documentation is hosted in our bigbio/quantms repo in branch readthedocs https://github.com/bigbio/quantms/tree/readthedocs then, any change to our tutorials, explanation etc should be done there. If you do a PR against that branch, the documentation will be updated automatically.

We used to use any of them, bigbio/dev or nf-core/dev. In bigbio/dev we have more freedom to merge, discuss results etc. Also the bigbio/dev in the organization with other libraries of the pipeline like pmultiqc or sedf-pipelines are stored, which make easier to share users, mantainers, etc. That organization repo is also used to store the docs and hosted readthedocs.

(Update, all of the search aspect and DIA-NN related outputs works fine RN but I am patching the file conversion aspect (since in some places, the suffix does not match, due to it being changed ... somewhere .. to .mzML in the experimental design).

❤️

Best!

nf-core / quantms

Allow passing through `raw` files if DIA is used #64

Description of feature