Closed jpfeuffer closed 1 year ago
When running on Linux (native builds, not Wine), only .d, .mzML, and .dia data are supported for DIA-NN.
Hello! I am sorry to resurrect this issue but I am working on a fork that enables bruker data support (for DIA) and was wondering if there is any interest in adopting the feature?
For this it actually bypasses the .mzML for DIANN, using the .d directly.
Let me know if you would like the PR and we can open an issue discussing the design choices to be made for the workflow (arguments/supported flows/error messaging ...)!
@ypriverol @daichengxin @jpfeuffer
Hi! Cool, yes, sure we would be interested. Some points to consider:
Right now it does convert to mzml for qc (in the data prep stage) but passes the .d to the rest of the pipeline (option B). This works fine for DIA but more likely than not will break every DDA aspect of the pipeline (which I have not tested).
Here are some more concrete things to consider.
On a marginally related note, since predicting the spectral lib is a bottleneck step in the analysis process and in theory should be the same if the same .fasta and digestion+mod+fragmentation parameters are being used, It would be great to have a 'diann_speclib' parameter, that would use that speclib instead of predicting one (I can also make this happen in a different PR).
Best
Hi!
Okay, option B sounds good.
Your related note also sounds reasonable in a separate PR. Do we output the predicted spectral lib as a final result? If not we should. Would it make sense to also add a "predict_only" parameter in which case the pipeline stops after prediction? (It might be a bit weird, though because then the main required input
would not be used and not unconditionally required anymore ;) ).
First, welcome to the project @jspaezp. Agree with @jpfeuffer points.
You can work with the .d files not need to convert to mzML, actually the mzMLs for the pmultiqc were replaced in the latest version by the mzml statistics files that are generated in the following step https://github.com/bigbio/quantms/blob/dev/modules/local/mzmlstatistics/main.nf. You don't need to extend it i nthe first iteration, but in the next step you can extend the mzmmlstatistics to consume .d files and produce the same output. With this we will have a full pmultiqc. We are happy to work with .d files and with discussions with Vadim he mentioned that some mzMLs will not work well with DIANN, then .d
is fine.
Happy to have you onboard!!!
Hello @ypriverol Thanks a lot!
(Update, all of the search aspect and DIA-NN related outputs works fine RN but I am patching the file conversion aspect (since in some places, the suffix does not match, due to it being changed ... somewhere .. to .mzML in the experimental design).
Best!
Hello @ypriverol Thanks a lot!
- I don't think I understand what you mean when you mention about changing the stats step (I might have forked the repo after the change was done).
For the pmultiqc
reports we generate at the end of the workflow, we use the mzmls to detect the number of MS2, % of identified MS2, charge state, etc. In the step that I mentioned, we read every mzML and generate a tsv
with the statistics needed by the pmultiqc. In the first implementation, you are doing with .d
we don't need to implement that (then, the pmultiqc should be disabled by default), but in future implementations we should enable .d
parsing. @WangHong007 @daichengxin I'm correct here?
- I agree, in my experience, depending on how the mzml is generated, it might either not maintain all the information or not maintain all the connections in the underlying data, making the usage of the data a lot harder for the search engine. (which is why I am adding this feature :P)
👍
- The last thing I wanted to ask was on the documentation side of things. Could you point me to where the long form docs are located? (is this >> https://github.com/bigbio/quantms/blob/readthedocs/docs/formats.rst << branch auto-generated or do I generate a separate PR there to update those docs? ... on a similar matter, for contributions, should I PR agains bigbio/dev or nf-core/dev ?)
All the documentation is hosted in our bigbio/quantms
repo in branch readthedocs
https://github.com/bigbio/quantms/tree/readthedocs then, any change to our tutorials, explanation etc should be done there. If you do a PR against that branch, the documentation will be updated automatically.
We used to use any of them, bigbio/dev
or nf-core/dev
. In bigbio/dev
we have more freedom to merge, discuss results etc. Also the bigbio/dev in the organization with other libraries of the pipeline like pmultiqc or sedf-pipelines are stored, which make easier to share users, mantainers, etc. That organization repo is also used to store the docs and hosted readthedocs.
(Update, all of the search aspect and DIA-NN related outputs works fine RN but I am patching the file conversion aspect (since in some places, the suffix does not match, due to it being changed ... somewhere .. to .mzML in the experimental design).
❤️
Best!
Description of feature
Currently we are always converting to mzML although some tools allow reading raw files (e.g. DIANN and I think even Comet) We only need to make sure that the mzml statistics module can also read raw files. There are Thermo raw parsing libraries for python out there. Or we convert just for QC while DIANN is running and discard intermediate results.