lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
123 stars 50 forks source link

readMgfData as chunks of file #510

Closed videlc closed 3 years ago

videlc commented 4 years ago

Hi,

This is rather an improvement than an issue. Being on lockdown doesn't give me access to a high memory computer and while working frome home, I encountered a protect() error while loading a "big" mgf file (gnps "all" library) using the readMgfData. I was wondering if it was possible to split the readMgfData function as chunks of the file (lets say 1000 first spectra and so on as :

group1<-readMgfData(file.mgf, chunk=file[1:1000])

group2<- readMgfData(file.mgf, chunk=file[1001,2000])
etc.

Of course it should be necessary to know the length of the "file.mgf" in order to be able to process all chunks individually to completely import the dataset.

Best regards, Vivian

EDIT : there should be a workaround converting mgf to some other format outside of R (e.g. using proteowizard) ?

lgatto commented 4 years ago

Hi - what you are requesting isn't possible, and wouldn't necessarily solve the issue if you want to have all the data in a single experiment. It is however possible to convert mgf files to mzML, which you would them be able to read in using the on-disk backend.

videlc commented 4 years ago

Will try that. Is there a reason readMgf has no "onDisk" mode ?

lgatto commented 4 years ago

Yes, because the on-disk access is implemented through mzR and proteowizard for mzML, mzXML, and netCDF. On disk access on mgf files would be very slow.

It would be possible to first load the data in memory, then transfer it to an on-disk backend (the Spectra package would probably be best for that), but I don't see a way to get around reading all in memory first, other than converting mgf to mzML.

videlc commented 4 years ago

Partial victory, we lost some relevant information on the way (spectrum ID from gnps CCMSLIB...., SMILES etc).

capture of GNPS_ALL avec conversion to mzML : image

Capture of GNPS "light" opened with readMgf image

Maybe the information is hidden elsewhere ? (lost ?). I'm not 100% sure the information is in the "all" .mgf file but since it's in their database, I would assume it is.

lgatto commented 4 years ago

What is lost is probably through the mgf to mzML conversion. You could try to read parse the mgf file on the command line to recover the missing fields and add them to the feature data.

I mention the Spectra package above, that will at some point provide a GNPS backend (and thus shouldn't loose anything). There will be, at a later stage, ways to get mgf data to disk (see here), but we aren't quite there yet.

videlc commented 4 years ago

This might be the solution, as I'm trying to store database information as a table plus export all spectra to tsv (link being GNPSID).

When you mention command line, are you talking about msconvert or something else ?

Maybe I'll try brute force this by expanding the protect() limit in R to enable a more straightforward solution.

lgatto commented 4 years ago

I meant something along the lines of grep or perl or awk. If you know that the order of the spectra in the MSnExp and original mgf files are the same, you could do

grep MYFIELD mgf > field.txt

and than read that file and append it to your feature data. If the order isn't the same, you'll need another field to match them back.

videlc commented 4 years ago

In facts that might be a solution :

$grep "CCMSLIB" ALL_GNPS.mgf | head
SPECTRUMID=CCMSLIB00000001547
SPECTRUMID=CCMSLIB00000001548
SPECTRUMID=CCMSLIB00000001549
SPECTRUMID=CCMSLIB00000001550
SPECTRUMID=CCMSLIB00000001551
SPECTRUMID=CCMSLIB00000001552
SPECTRUMID=CCMSLIB00000001553
SPECTRUMID=CCMSLIB00000001554
SPECTRUMID=CCMSLIB00000001555
SPECTRUMID=CCMSLIB00000001556

Thank for the advice, I did not know the information was so easily accesible. Although this might be risky if one spectra (in the middle of the list) is missing information. I'm going to put some mort effort in this later and keep you updated.

EDIT : In facts there are more "CCMSLIBS" than the spectrum count of the MsnExp (152546, see above)

$ grep "CCMSLIB" ALL_GNPS.mgf | wc -l
152548
videlc commented 4 years ago

Hi

Finally found the time to do this :

Extracting each spectra to a single file by (1) searching the "BEGIN ION" delimiter and (2) saving a file containing each "BEGIN IONS" until next delimiter using awk : $ awk '/BEGIN IONS/{x="SPECTRUM"++i;}{print >> x; close(x)}' ALL_GNPS.mgf

This operation took ~~ 2 days so consider running it via nohup/screen.

Counting how many spectra were exported :

list_spectra<-list.files(pattern="SPECTRUM",path = "gnpsdb/") 
> length(list_spectra)
[1] 155327

I don't know why there are more spectra than "CCMSLIB" but maybe some of them don't have an identifier yet.

Example sample spectra


> sample<-sample(list_spectra,size = 1)
> print(sample)
[1] "SPECTRUM131538"

Can it be read ?

> readMgfData(paste("gnpsdb/",sample,sep=""))
MSn experiment data ("MSnExp")
Object size in memory: 0.01 Mb
- - - Spectra data - - -
 MS level(s): 2 
 Number of spectra: 1 
 MSn retention times: 0:0 - 0:0 minutes
- - - Processing information - - -
Data loaded: Wed May 27 16:54:44 2020 
 MSnbase version: 2.12.0 
- - - Meta data  - - -
phenoData
  rowNames: 1
  varLabels: sampleNames fileNumbers
  varMetadata: labelDescription
Loaded from:
  SPECTRUM131538 
protocolData: none
featureData
  featureNames: X1
  fvarLabels: PEPMASS CHARGE ... SCANS (19 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'

Data structure seems preserved image

Exporting each data info and spectra into single tables (time it took : 10 mins) :

require(doParallel)
cl <- makeCluster(10)
registerDoParallel(cl)

foreach (i = list_spectra) %dopar% {
  require(MSnbase)

  mgf<-readMgfData(paste("gnpsdb/",i,sep=""))
  df<-data.frame(mgf@featureData@data)
  spectra<-data.frame(mz=mgf@assayData$X1@mz,
                      intensity=mgf@assayData$X1@intensity)

  write.table(df,file=paste("gnpsdb/",df$SPECTRUMID,"_feat_data.txt",sep=""),row.names=F,sep='\t')
  write.table(spectra,file=paste("gnpsdb/",df$SPECTRUMID,"_spectra.txt",sep=""),row.names=F,sep='\t')

}
parallel::stopCluster(cl)

> length(list.files(path="gnpsdb/",pattern = "_spectra.txt"))
[1] 155327
> length(list.files(path="gnpsdb/",pattern = "_feat_data.txt"))
[1] 155327

Now we can read/rbind all "feat_data" tables to build the "database" and read single "spectra.txt" when needed.

I don't know if this counts as solution for issue.

Best regards, Vivian