jeffsocal / tidyproteomics

An S3 data object and framework for common quantitative proteomic analyses
https://jeffsocal.github.io/tidyproteomics/
MIT License
36 stars 5 forks source link

parsing MaxQuant protein outputs fails #16

Closed Ravasz closed 7 months ago

Ravasz commented 9 months ago

I was trying out this package and found that parsing MaxQuant outputs returns an error.

Specifically, I used the MaxQuant output from https://www.ebi.ac.uk/pride/archive/projects/PXD000987 . In the MaxQuantOutput.zip file there is a proper proteinGroups.txt file.

I tried to import this to tidyproteomics with the following code:

test_mq_prot <- "/location/of/file/proteinGroups.txt" |> import("MaxQuant", "proteins")

I got the following error:

ℹ Importing MaxQuant:
ℹ ... created a protein_group accounting
ℹ ... split protein with \; resulting in 8352 new rows
ℹ ... removed ^REV\_ from protein in 848 rows
ℹ ... no homology detected
ℹ ... match between runs not found in data
Error in `import_validate()`:xt
! ... import error, protein not retaining values, check import regex
Run `rlang::last_error()` to see where the error occurred.
✖ ... parsing proteinGroups.txt [3.6s]

Looking into it, the MaxQuant parser appears to try and use a regex pattern to extract protein IDs from the protein ID column in the file . The regex pattern seems to be written for a FASTA header, but the protein ID column already contains proper UniProt IDs in MaxQuant outputs, therefore the regex extracts an empty string. Specifically, the regex looks to extract all characters between the first two | symbols in a string, but uniprot IDs do not have | symbols in them. As all protein identifiers are returned as NA, the pipeline fails.

To fix this, I recommend taking the built-in config/MaxQuant_protein.tsv config file and replacing the regex pattern (?<=\\|).*?(?=\\|) in the row identifier, column pattern_extract, with NA.

Additional tweaks may also be required as the peptide importing pipeline also fails, and I think the number of unique peptides is not extracted properly from MaxQuant protein outputs, but I am still looking into that.

Otherwise thank you for putting together this package, I definitely see a use case for it.

jeffsocal commented 7 months ago

Ravasz - thank you for the input, apologies for the long delay. You are correct about the MaxQuant parsing, and unfortunately it comes down to what FASTA file one uses in the analysis. I have since created a function to export the current default config file to the current directory export_config() which can be edited then referenced when importing. I am contemplating ways to check the regex against the import file and provide better error handling, but I don't have a solution I like yet.