rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
34 stars 24 forks source link

Down-sampling spectra #234

Open cutleraging opened 2 years ago

cutleraging commented 2 years ago

I would like to down-sample spectra in order to know if I am close to reaching saturation of proteins identified in a sample. First I would like to know if this seems reasonable, and if so, what properties of the data need to remain balanced when down-sampling. Second, I am thinking to do this by simply using the sample() function and getting different percentages of spectra. Any thoughts on this?

jorainer commented 2 years ago

I guess downsampling should be reasonable, but I'm not into proteomics data analysis and I guess it will also depend on what type of data you have.

Technically, you could indeed simply use sps_sub <- sps[sample(100, length(sps))] to randomly select 100 spectra from a Spectra object sps - but eventually you might need to order these again or to randomly select among subsets of spectra in sps.

cutleraging commented 2 years ago

Thank you Johannes for you reply. I have done something similar with what you suggested with some test data, but now am trying to load my actual data using the following:

fls <- “~/Sample1.mzML" sps_all <- Spectra(fls, source = MsBackendMzR())

I created this file by converting it from a .raw using MSconvert. It is 1.79 GB. However, it takes a very very long time to read this file. Making it impossible for me to run a downsampling script on 20 files. Is there a faster way to do this? Should I use a different format?

Ronnie

On Jan 31, 2022, at 08:10, Johannes Rainer @.**@.>> wrote:

CAUTION: This email comes from an external source; the attachments and/or links may compromise our secure environment. Do not open or click on suspicious emails. Please click on the “Phish Alert” button on the top right of the Outlook dashboard to report any suspicious emails.

I guess downsampling should be reasonable, but I'm not into proteomics data analysis and I guess it will also depend on what type of data you have.

Technically, you could indeed simply use sps_sub <- sps[sample(100, length(sps))] to randomly select 100 spectra from a Spectra object sps - but eventually you might need to order these again or to randomly select among subsets of spectra in sps.

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frformassspectrometry%2FSpectra%2Fissues%2F234%23issuecomment-1025722464&data=04%7C01%7Cronald.cutler%40einsteinmed.edu%7Cd0ee5f6d35c542f1d58b08d9e4bb0e37%7C9c01f0fd65e040c089a82dfd51e62025%7C0%7C0%7C637792314330602765%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=viTV2%2F%2Bfmh1aotSmGHBVyqHGDEs7V05P6YXUeDH7z5c%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAUCDKMZI3SKAIRQSKLPTNGLUY2C4LANCNFSM5NDJ5ZQQ&data=04%7C01%7Cronald.cutler%40einsteinmed.edu%7Cd0ee5f6d35c542f1d58b08d9e4bb0e37%7C9c01f0fd65e040c089a82dfd51e62025%7C0%7C0%7C637792314330602765%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Y%2FTFTnA4A4sdjfBVm4WII%2FOusleEZUm7fSmg%2BK4ahQI%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cronald.cutler%40einsteinmed.edu%7Cd0ee5f6d35c542f1d58b08d9e4bb0e37%7C9c01f0fd65e040c089a82dfd51e62025%7C0%7C0%7C637792314330602765%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=OUamL3PkyMcePhXhZJQLACzAqQWB0IOIiZGIbN1lKG0%3D&reserved=0 or Androidhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cronald.cutler%40einsteinmed.edu%7Cd0ee5f6d35c542f1d58b08d9e4bb0e37%7C9c01f0fd65e040c089a82dfd51e62025%7C0%7C0%7C637792314330602765%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Fvp4kut4sQETt4QhlwwbZYEO0wOOe6daGZtZ%2F6iQbo0%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>

jorainer commented 2 years ago

Hm, unfortunately there is no backend that does not require reading at least some of the data from the original files. The MsBackendMzR should actually already be relatively fast because it reads only general information from the original files (not the peaks data). So, the subsetting operation would also be quite fast. The only thing that might eventually help is to parallelize this operation (on a per-file basis)?