Open cutleraging opened 2 years ago
I guess downsampling should be reasonable, but I'm not into proteomics data analysis and I guess it will also depend on what type of data you have.
Technically, you could indeed simply use sps_sub <- sps[sample(100, length(sps))]
to randomly select 100 spectra from a Spectra
object sps
- but eventually you might need to order these again or to randomly select among subsets of spectra in sps
.
Thank you Johannes for you reply. I have done something similar with what you suggested with some test data, but now am trying to load my actual data using the following:
fls <- “~/Sample1.mzML" sps_all <- Spectra(fls, source = MsBackendMzR())
I created this file by converting it from a .raw using MSconvert. It is 1.79 GB. However, it takes a very very long time to read this file. Making it impossible for me to run a downsampling script on 20 files. Is there a faster way to do this? Should I use a different format?
Ronnie
On Jan 31, 2022, at 08:10, Johannes Rainer @.**@.>> wrote:
CAUTION: This email comes from an external source; the attachments and/or links may compromise our secure environment. Do not open or click on suspicious emails. Please click on the “Phish Alert” button on the top right of the Outlook dashboard to report any suspicious emails.
I guess downsampling should be reasonable, but I'm not into proteomics data analysis and I guess it will also depend on what type of data you have.
Technically, you could indeed simply use sps_sub <- sps[sample(100, length(sps))] to randomly select 100 spectra from a Spectra object sps - but eventually you might need to order these again or to randomly select among subsets of spectra in sps.
— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frformassspectrometry%2FSpectra%2Fissues%2F234%23issuecomment-1025722464&data=04%7C01%7Cronald.cutler%40einsteinmed.edu%7Cd0ee5f6d35c542f1d58b08d9e4bb0e37%7C9c01f0fd65e040c089a82dfd51e62025%7C0%7C0%7C637792314330602765%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=viTV2%2F%2Bfmh1aotSmGHBVyqHGDEs7V05P6YXUeDH7z5c%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAUCDKMZI3SKAIRQSKLPTNGLUY2C4LANCNFSM5NDJ5ZQQ&data=04%7C01%7Cronald.cutler%40einsteinmed.edu%7Cd0ee5f6d35c542f1d58b08d9e4bb0e37%7C9c01f0fd65e040c089a82dfd51e62025%7C0%7C0%7C637792314330602765%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Y%2FTFTnA4A4sdjfBVm4WII%2FOusleEZUm7fSmg%2BK4ahQI%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cronald.cutler%40einsteinmed.edu%7Cd0ee5f6d35c542f1d58b08d9e4bb0e37%7C9c01f0fd65e040c089a82dfd51e62025%7C0%7C0%7C637792314330602765%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=OUamL3PkyMcePhXhZJQLACzAqQWB0IOIiZGIbN1lKG0%3D&reserved=0 or Androidhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cronald.cutler%40einsteinmed.edu%7Cd0ee5f6d35c542f1d58b08d9e4bb0e37%7C9c01f0fd65e040c089a82dfd51e62025%7C0%7C0%7C637792314330602765%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Fvp4kut4sQETt4QhlwwbZYEO0wOOe6daGZtZ%2F6iQbo0%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>
Hm, unfortunately there is no backend that does not require reading at least some of the data from the original files. The MsBackendMzR
should actually already be relatively fast because it reads only general information from the original files (not the peaks data). So, the subsetting operation would also be quite fast. The only thing that might eventually help is to parallelize this operation (on a per-file basis)?
I would like to down-sample spectra in order to know if I am close to reaching saturation of proteins identified in a sample. First I would like to know if this seems reasonable, and if so, what properties of the data need to remain balanced when down-sampling. Second, I am thinking to do this by simply using the sample() function and getting different percentages of spectra. Any thoughts on this?