nf-core / proteomicslfq

Proteomics label-free quantification (LFQ) analysis pipeline
https://nf-co.re/proteomicslfq
MIT License
33 stars 19 forks source link

Fraction Column is lost and reevaluated by MSStats #174

Open tillenglert opened 2 years ago

tillenglert commented 2 years ago

I'm currently adding MSFragger as a search engine for ProteomicsLFQ. When running the minimal test profile I ran into an issue with MSstats. The tool could not figure out the fractionation of the samples and stopped the executation with following message:

"** It is hard to find the same fractionation across sample, due to lots of overlapped features between fractionations.
                     Please add Fraction column in input."

Now searching for the reason of this issue I looked into the source code of MSstats and the function OpenMStoMSstatsFormat, which preprocesses the data for MSstats before doing the dataProcess function. This function also just takes the required columns of the out.csv of proteomicslfq which are the following:

requiredinput.general <- c("ProteinName", "PeptideSequence", "PrecursorCharge", 
                                "FragmentIon", "ProductCharge", "IsotopeLabelType",
                                "Condition", "BioReplicate", "Run", "Intensity")

source: https://rdrr.io/bioc/MSstats/src/R/OpenMStoMSstatsFormat.R (MSstats 3.22)

Which leads to the loss of the Fraction Column. This was not leading to an Error when using Comet or MSGF+ search engines, as MSstats is analysing the features and can detect if its Technical Replicates or Fractionated Samples if the features are clear enough. I guess the problem in MSFragger was that it found too many overlapping features and at the same time too many duplicated features across fractions and samples.

When testing the newest version of MSstats (4.2) it could actually correctly assign the fractions. The latest version is dependent on MSstatsConvert which includes the conversion tools for different MS tools. So maybe it would make the ProteomicsLFQ pipeline more robust to errors especially as the information of fractions is lost.

jpfeuffer commented 2 years ago

I think it would be better if openms just exports a fraction column correctly. Instead of hoping for a correct guess b Msstats.

jpfeuffer commented 2 years ago

@timosachsenberg I have no idea why this is not the case. I thought we export everything.

jpfeuffer commented 2 years ago

I also did a PR to MSstats once to address this issue. Maybe it did not make it into 3.22? Did you check 3.22.1 or whatever ele came before 4? Because I never made 4 work with newer OpenMS versions because OpenMS does not build on bioconda anymore and is incompatible with some dependencies I think.

jpfeuffer commented 2 years ago

https://github.com/Vitek-Lab/MSstats/commit/d78e2aadb6732d363a04503b76dc2297384c30c9

tillenglert commented 2 years ago

https://github.com/Vitek-Lab/MSstats/blob/3a3acbbd37f3cdebbb8db7bf165c96306f732e2d/R/converters.R#L234

Seems not to be in the code anymore, after they changed their code structure!

tillenglert commented 2 years ago

I tested with 3.22.1, which should be the latest version before v4.

And yes v4 is not compatible in any case to be used in the nfcore/proteomicslfq docker... For testing (v4.2.0) I had to build another container.

timosachsenberg commented 2 years ago

@timosachsenberg I have no idea why this is not the case. I thought we export everything.

Yeah, we checked. We export it, and it seems that the issue is on the MSstats side (see Till's comments).

jpfeuffer commented 2 years ago

Can you find out why it is not compatible? In theory the openms::openms2.7.0pre package should be built with the latest conda packages. 2.6.0 from bioconda is of course outdated. It could be that some thirdparties clash in the openms-thirdparty package. I already removed some of them (maybe some of them can be fixed by conda rebuilds/updates). In the worst case we use openms and only add the ones we need separately.

I think this would be the way forward. Otherwise we need to monkey patch the function in our R code. I remember having done such a thing before in my own scripts.

tillenglert commented 2 years ago

proteomicslfq_docker_build.log

Attached is the log of the dockerfile build of nf-core/proteomicslfq with the following environment.yml:

name: nf-core-proteomicslfq-1.0.0 channels:

So there are conflicts but conda can't figure out where.

jpfeuffer commented 2 years ago

I would try "mamba" to find the conflicts. Conda is basically useless for this. And in this case even seems to be bugged. I think you can just install mamba instead of conda and use the same commands.

tillenglert commented 2 years ago

After some testing I finally managed to include MSstats v4.2, but for this I needed to change the version of python (to v3.9) and ptxqc (to v1.0.12). Unfortunately, this leads to an error in ptxqc when running the test profile. The current environment is:

name: nf-core-proteomicslfq-1.0.0 channels:

The error of ptxqc is the following:

Loading required package: PTXQC Loading package PTXQC (version 1.0.12) Error in file.exists(pattern = mqpar_filename) : invalid 'file' argument Calls: createReport -> getMetaFilenames -> getMQPARValue -> file.exists In addition: Warning messages: 1: In (function (parents, id = names(parents), name = id, obsolete = setNames(nm = id, : Some parent terms not found: MS:1001456 2: In (function (parents, id = names(parents), name = id, obsolete = setNames(nm = id, : Some parent terms not found: UO:0000000 Execution halted

timosachsenberg commented 2 years ago

I will ask @cbielow if he knows what the issue is here

cbielow commented 2 years ago

I cannot find anything obviously wrong with the code in PTXQC. There should be a warning() (not an error) on the console which provides further details if mqpar.xml cannot be found, but your output has none... this is a bit strange. Can someone point me to the script and the data that you are actually running?!

jpfeuffer commented 2 years ago

Why does it want an mqpar.xml at all? We input mztab.

cbielow commented 2 years ago

its quite an unusual combination indeed, but the mqpar.xml is used to find some threshold parameters, if available.

tillenglert commented 2 years ago

The script I'm using is this nextflow script:

https://github.com/tillenglert/proteomicslfq/blob/master/main.nf#L1304

with this config (testfiles): https://github.com/tillenglert/proteomicslfq/blob/master/conf/test.config#L20

As I'm still working on msfragger I tested the ptxqc process with comet. The logs and inputfiles are attached to this comment: ptxqc_logs.zip

cbielow commented 2 years ago

the error is fixed in the current development version of PTXQC. It will be some time before the new version is published.

Since this is a regression, the last working version should be PTXQC v1.00.10 - May 2021. If you can use that version for the time being, the bug should be resolved.

tillenglert commented 2 years ago

Ah perfect! I haven't tried this version, but it's working and compatible with the remaining packages.

This is the current environment I'm using, which is working vor msstats and ptxqc:

name: nf-core-proteomicslfq-1.0.0 channels:

jpfeuffer commented 2 years ago

Feel free to open a PR with the environment update