Closed Ravasz closed 7 months ago
Ravasz - thank you for the input, apologies for the long delay. You are correct about the MaxQuant parsing, and unfortunately it comes down to what FASTA file one uses in the analysis. I have since created a function to export the current default config file to the current directory export_config()
which can be edited then referenced when importing. I am contemplating ways to check the regex against the import file and provide better error handling, but I don't have a solution I like yet.
I was trying out this package and found that parsing MaxQuant outputs returns an error.
Specifically, I used the MaxQuant output from https://www.ebi.ac.uk/pride/archive/projects/PXD000987 . In the
MaxQuantOutput.zip
file there is a properproteinGroups.txt
file.I tried to import this to tidyproteomics with the following code:
test_mq_prot <- "/location/of/file/proteinGroups.txt" |> import("MaxQuant", "proteins")
I got the following error:
Looking into it, the MaxQuant parser appears to try and use a regex pattern to extract protein IDs from the protein ID column in the file . The regex pattern seems to be written for a FASTA header, but the protein ID column already contains proper UniProt IDs in MaxQuant outputs, therefore the regex extracts an empty string. Specifically, the regex looks to extract all characters between the first two
|
symbols in a string, but uniprot IDs do not have|
symbols in them. As all protein identifiers are returned as NA, the pipeline fails.To fix this, I recommend taking the built-in
config/MaxQuant_protein.tsv
config file and replacing the regex pattern(?<=\\|).*?(?=\\|)
in the rowidentifier
, columnpattern_extract
, with NA.Additional tweaks may also be required as the peptide importing pipeline also fails, and I think the number of unique peptides is not extracted properly from MaxQuant protein outputs, but I am still looking into that.
Otherwise thank you for putting together this package, I definitely see a use case for it.