Support for Percolator output files

wsnoble commented 2 years ago

It would be great if Percolator PSM-level tab-delimited output files could be supported by FlashLFQ. I looked into this, and there are two major and several minor challenges associated with making this happen.

The first major issue is that Percolator does not include the mzML file name, but only an integer file index, in the output file. This is obviously problematic, and it's something we can look into fixing on the Percolator end of things.

The second major issue is that Percolator does not include retention time. This is harder for us to fix, because this information is also not included in the outputs of many common search engines. It seems like, if you have the scan numbers in the Percolator output, it should be feasible for FlashLFQ to grab the RT from the mzML file. Is this doable?

The other minor issues are that Percolator does not have a "Base Sequence" column and that Percolator uses comma-delimited protein ID lists, rather than semi-colon delimited lists. These latter ones, along with differences in column naming, should be easy to handle.

Here is a tiny sample Percolator output file. sample.txt

wsnoble commented 2 years ago

Update on this: I am trying to run FlashLFQ on a real example. This is a Percolator file with 184k lines, drawn from a set of 11 mzml files. One thing that I noticed is that on linux the --thr option has no effect: the process runs on a single thread. I get the following output:

$ flashlfq --idt /net/noble/vol1/home/noble/proj/2020_kiahales_rdeep/results/wnoble/2022-06-21flashlfq/rep1/C1/percolator.target.psms.txt --rep /net/noble/vol1/home/noble/proj/2020_kiahales_rdeep/data/rep1 --out /net/noble/vol1/home/noble/proj/2020_kiahales_rdeep/results/wnoble/2022-06-21flashlfq/rep1/C1 --chg
Opening PSM file /net/noble/vol1/home/noble/proj/2020_kiahales_rdeep/results/wnoble/2022-06-21flashlfq/rep1/C1/percolator.target.psms.txt

But then it just hangs seemingly forever. Would one expect this analysis to take multiple days to complete? Is there any way to get it to give an indication of progress or to make the program run faster?

trishorts commented 2 years ago

rob thinks the problem may be in the first step. if this percolator file has no retention time output, then we have to look up the time for each scan. that step was not done in parallel. so maybe that takes a long time? can you confirm? maybe I can split the lookup by file. i'm pretty booked for the next ten days. i'll see what I can do,

wsnoble commented 2 years ago

I made a version on the Percolator file that contains only the header plus 9 PSMs drawn from the same mzML file. That is running now -- it's been going about 2 hours so far.

percolator.target.psms.txt

trishorts commented 2 years ago

can you tell me about these dots in the path? not sure how that works.

wsnoble commented 2 years ago

Those are relative pathnames. That works the same in DOS/Windows and Unix. The resolved pathname is /net/noble/vol1/home/noble/proj/2020_kiahalesrdeep/data/rep1/Pf...

Bill

On Thu, Jun 30, 2022 at 7:41 AM trishorts @.***> wrote:

can you tell me about these dots in the path? not sure how that works. [image: image] https://user-images.githubusercontent.com/16841846/176705829-c5a98880-dad7-43ba-90eb-bd4febdb6349.png

— Reply to this email directly, view it on GitHub https://github.com/smith-chem-wisc/FlashLFQ/issues/103#issuecomment-1171306373, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFL4KJS3EOS2GG5WOBGSRC3VRWWXVANCNFSM5KTSFUPA . You are receiving this because you were mentioned.Message ID: @.***>

wsnoble commented 2 years ago

What is the status of this issue? I have not been able to get it to work on my end.

trishorts commented 2 years ago

i worked on an update that would allow no extensions, extensions, full windows file paths and full linux filepaths in the identification file. all seem to work in my tests. I also made input file processing parallel. maybe that will speed things up. I will have to wait for rob to review, suggest edits, and merge. But, you can test the temporary version by downloading the appropriate file here: https://ci.appveyor.com/project/smith-chem-wisc/flashlfq/builds/44274888/artifacts

wsnoble commented 2 years ago

I am dependent on our sysadmins to install the latest released version, so I'll wait till this gets reviewed.

wsnoble commented 2 years ago

Any progress on releasing this version?

trishorts commented 2 years ago

my pr has been reviewed and merged. I added flexibility for paths in the identification file.

First, can you tell me how you access flashlfq. Do you use the standard commandline, the docker or bioconda? Once I know that, I know what to focus on for getting a release. bioconda takes me the longest b/c it confuses me. the others are quite fast.

Second, do you think it would be worthwhile for me to do more testing on your data to make sure that my fixes help you? I have time. Just need links and so forth.

wsnoble commented 2 years ago

Thanks. Let me try this new version, and if it doesn't work I will send more sample files. I believe our sysadmins are installing via the standard command line method.

trishorts commented 2 years ago

rob will make the release today. i will notify you when that happens

trishorts commented 2 years ago

new release available here: https://github.com/smith-chem-wisc/FlashLFQ/releases it will be a little until i can get the bioconda thing to work

trishorts commented 2 years ago

https://github.com/bioconda/bioconda-recipes/pull/36488

trishorts commented 2 years ago

bioconda/anaconda build is out https://anaconda.org/bioconda/flashlfq

wsnoble commented 2 years ago

FYI, I got this error message again:

Unhandled exception. System.UnauthorizedAccessException: Access to the path '/net/gs/vol3/software/modules-sw/flashlfq/1.2.3/Linux/CentOS7/x86_64/LicenceAgreements.toml' is denied.

I think we know how to fix this, since you previously sent the toml file that we need to put next to the executable. But you might want to make a note about this in your installation guide for people trying to do a linux installation. We're installing from the command line.

wsnoble commented 2 years ago

I have a question for you regarding file formats. Currently, the version of Percolator inside Crux provides a different (more extensive) set of tab-delimited output columns than the standalone version of Percolator. We are moving away from this setup, since it's hard to maintain. The result will be that the version of Percolator in the new version of Crux will report fewer columns and will use different column names than in the old version of Crux. Presumably, it's easy to change the column names when FlashLFQ tries to parse such a file. But I wonder whether there is critical information that you need that is missing from the Percolator output. Here are the columns that Percolator produces:

PSMId   score   q-value posterior_error_prob    peptide proteinIds

wsnoble commented 2 years ago

FYI, after fixing the unhandled exception mentioned above, I can now report that I can run flashlfq successfully on the example files that you sent a while back. Yay!

Next I will try it on some of our locally generated files.

wsnoble commented 2 years ago

Can you help me interpret this error message?

$ flashlfq --idt /net/noble/vol1/home/noble/proj/2020_kiahales_rdeep/results/wnoble/2022-06-29flashlfq/rep1/C1/percolator.target.psms.txt --rep /net/noble/vol1/home/noble/proj/2020_kiahales_rdeep/data/rep1 --out /net/noble/vol1/home/noble/proj/2020_kiahales_rdeep/results/wnoble/2022-06-29flashlfq/rep1/C1 --chg
Opening PSM file /net/noble/vol1/home/noble/proj/2020_kiahales_rdeep/results/wnoble/2022-06-29flashlfq/rep1/C1/percolator.target.psms.txt
Done reading PSMs; found 0
No peptide IDs for the specified spectra files were found! Check to make sure the spectra file names match between the ID file and the spectra files

I am not sure what a "peptide ID" is. The input file is attached. I'm guessing I've got the column names wrong perhaps?

percolator.target.psms.txt

I did double check that the spectrum file listed in the PSM file (/net/noble/vol1/home/noble/proj/2020_kiahales_rdeep/results/wnoble/2022-06-03pipeline/../../../data/rep1/Pf_C1-593_MixES-Sol_SG-1_rKCTi_VO2_108.mzML) exists on disk.

trishorts commented 2 years ago

this appears to be reading fine on my end. will you send me the mzml. if you zip it, i think you can just drop it in the comment box.

trishorts commented 2 years ago

i tried to download it at the link you posted above but the file is in your trash: https://drive.google.com/file/d/14YoGqBFs-bfXEtF6Ym18KqWYEnIgNkpK/view?usp=drive_web

wsnoble commented 2 years ago

It's too big for git (188 MB). Here is a link:

https://drive.google.com/file/d/1bezP5celq165oXY-mBivQE96LKToouNN/view?usp=sharing

trishorts commented 2 years ago

FlashLFQ_2022-08-23-12-25-38.zip

Ran on my machine in the GUI w/ no problem. Results attached. Will run on cmd line and report.

trishorts commented 2 years ago

i think cmd produced the error you see:

trishorts commented 2 years ago

when i run from commandline on the visual studio version I get no error.

wsnoble commented 2 years ago

I don't really know what "cmd" is, but I take it's something you use to try to run C# programs under Linux. Anyway, the upshot seems to be that this works under Windows but not Linux. Is that right?

trishorts commented 2 years ago

close. the version that I download from github doesn't work on the commandline in windows. But the version I upload to GitHub does. The catch is that I can debug the version I upload, but not the version I download. Rob is going to help on this one.

smith-chem-wisc / FlashLFQ

Support for Percolator output files #103