smith-chem-wisc / FlashLFQ

Ultra-fast label-free quantification algorithm for mass-spectrometry proteomics
GNU Lesser General Public License v3.0
19 stars 15 forks source link

Questions about the generic input format #98

Closed wfondrie closed 3 years ago

wfondrie commented 3 years ago

Hi folks,

I'm trying to add support for generating a FlashLFQ input from mokapot and I had a couple questions about the generic input format:

  1. Should path be included or specifically excluded from the File Name column?
  2. How should multiple proteins be delimited in the Protein Accession format? The example file uses |, but I was wondering if other delimiters could be used. Specifically, I want to be able to handle cases where folks provide the full UniProt ID from a FASTA file (<database>|<accession>|<identifier>) in their protein lists.

Thank you creating and maintaining a great open-source proteomics tool with awesome documentation!

trishorts commented 3 years ago

Hi Will. We'll get you answers soon.

rmillikin commented 3 years ago
  1. Short answer: including or excluding the path and/or extension is OK.

Long answer: FlashLFQ will trim the spectra files you pass down to the filename (without path or extension) and will do the same to the filename column in the PSMs file. If the PSM's spectra file without extension is not present in the list of spectra files without extensions, then the PSM is skipped. So including a full path or extension is OK; both of these will get stripped out. However the program will probably get messed up if you include an extra period in the spectra name, because FlashLFQ will think this is part of the extension.

  1. This is not well documented, but for generic PSM files the delimiter is the semicolon character (";"). Currently there is no way to use a custom character to delimit proteins. I'll fix the generic example file, because as you pointed out, it uses the | character. We use this character in MetaMorpheus to delimit proteins within a protein group, and we treat this as a different scenario than delimiting protein groups from each other.

https://github.com/smith-chem-wisc/FlashLFQ/blob/fa2bf92e0acf5ac05d51543a68364049aa7905ae/Util/PsmReader.cs#L32

Thanks for integrating FlashLFQ into mokapot, I'm excited to see how it goes! We're always happy to provide support.

wfondrie commented 3 years ago

Great - thank you for the details!