parsing MaxQuant protein outputs fails

I was trying out this package and found that parsing MaxQuant outputs returns an error.

Specifically, I used the MaxQuant output from https://www.ebi.ac.uk/pride/archive/projects/PXD000987 . In the MaxQuantOutput.zip file there is a proper proteinGroups.txt file.

I tried to import this to tidyproteomics with the following code:

test_mq_prot <- "/location/of/file/proteinGroups.txt" |> import("MaxQuant", "proteins")

I got the following error:

ℹ Importing MaxQuant:
ℹ ... created a protein_group accounting
ℹ ... split protein with \; resulting in 8352 new rows
ℹ ... removed ^REV\_ from protein in 848 rows
ℹ ... no homology detected
ℹ ... match between runs not found in data
Error in `import_validate()`:xt
! ... import error, protein not retaining values, check import regex
Run `rlang::last_error()` to see where the error occurred.
✖ ... parsing proteinGroups.txt [3.6s]

Looking into it, the MaxQuant parser appears to try and use a regex pattern to extract protein IDs from the protein ID column in the file . The regex pattern seems to be written for a FASTA header, but the protein ID column already contains proper UniProt IDs in MaxQuant outputs, therefore the regex extracts an empty string. Specifically, the regex looks to extract all characters between the first two | symbols in a string, but uniprot IDs do not have | symbols in them. As all protein identifiers are returned as NA, the pipeline fails.

To fix this, I recommend taking the built-in config/MaxQuant_protein.tsv config file and replacing the regex pattern (?<=\\|).*?(?=\\|) in the row identifier, column pattern_extract, with NA.

Additional tweaks may also be required as the peptide importing pipeline also fails, and I think the number of unique peptides is not extracted properly from MaxQuant protein outputs, but I am still looking into that.

Otherwise thank you for putting together this package, I definitely see a use case for it.

jeffsocal / tidyproteomics

parsing MaxQuant protein outputs fails #16