Closed videlc closed 3 years ago
Hi - what you are requesting isn't possible, and wouldn't necessarily solve the issue if you want to have all the data in a single experiment. It is however possible to convert mgf files to mzML, which you would them be able to read in using the on-disk backend.
Will try that. Is there a reason readMgf has no "onDisk" mode ?
Yes, because the on-disk access is implemented through mzR and proteowizard for mzML, mzXML, and netCDF. On disk access on mgf files would be very slow.
It would be possible to first load the data in memory, then transfer it to an on-disk backend (the Spectra
package would probably be best for that), but I don't see a way to get around reading all in memory first, other than converting mgf to mzML.
Partial victory, we lost some relevant information on the way (spectrum ID from gnps CCMSLIB...., SMILES etc).
capture of GNPS_ALL avec conversion to mzML :
Capture of GNPS "light" opened with readMgf
Maybe the information is hidden elsewhere ? (lost ?). I'm not 100% sure the information is in the "all" .mgf file but since it's in their database, I would assume it is.
What is lost is probably through the mgf to mzML conversion. You could try to read parse the mgf file on the command line to recover the missing fields and add them to the feature data.
I mention the Spectra package above, that will at some point provide a GNPS backend (and thus shouldn't loose anything). There will be, at a later stage, ways to get mgf data to disk (see here), but we aren't quite there yet.
This might be the solution, as I'm trying to store database information as a table plus export all spectra to tsv (link being GNPSID).
When you mention command line, are you talking about msconvert or something else ?
Maybe I'll try brute force this by expanding the protect() limit in R to enable a more straightforward solution.
I meant something along the lines of grep
or perl
or awk
. If you know that the order of the spectra in the MSnExp
and original mgf files are the same, you could do
grep MYFIELD mgf > field.txt
and than read that file and append it to your feature data. If the order isn't the same, you'll need another field to match them back.
In facts that might be a solution :
$grep "CCMSLIB" ALL_GNPS.mgf | head
SPECTRUMID=CCMSLIB00000001547
SPECTRUMID=CCMSLIB00000001548
SPECTRUMID=CCMSLIB00000001549
SPECTRUMID=CCMSLIB00000001550
SPECTRUMID=CCMSLIB00000001551
SPECTRUMID=CCMSLIB00000001552
SPECTRUMID=CCMSLIB00000001553
SPECTRUMID=CCMSLIB00000001554
SPECTRUMID=CCMSLIB00000001555
SPECTRUMID=CCMSLIB00000001556
Thank for the advice, I did not know the information was so easily accesible. Although this might be risky if one spectra (in the middle of the list) is missing information. I'm going to put some mort effort in this later and keep you updated.
EDIT : In facts there are more "CCMSLIBS" than the spectrum count of the MsnExp (152546, see above)
$ grep "CCMSLIB" ALL_GNPS.mgf | wc -l
152548
Hi
Finally found the time to do this :
Extracting each spectra to a single file by (1) searching the "BEGIN ION" delimiter and (2) saving a file containing each "BEGIN IONS" until next delimiter using awk :
$ awk '/BEGIN IONS/{x="SPECTRUM"++i;}{print >> x; close(x)}' ALL_GNPS.mgf
This operation took ~~ 2 days so consider running it via nohup/screen.
Counting how many spectra were exported :
list_spectra<-list.files(pattern="SPECTRUM",path = "gnpsdb/")
> length(list_spectra)
[1] 155327
I don't know why there are more spectra than "CCMSLIB" but maybe some of them don't have an identifier yet.
Example sample spectra
> sample<-sample(list_spectra,size = 1)
> print(sample)
[1] "SPECTRUM131538"
Can it be read ?
> readMgfData(paste("gnpsdb/",sample,sep=""))
MSn experiment data ("MSnExp")
Object size in memory: 0.01 Mb
- - - Spectra data - - -
MS level(s): 2
Number of spectra: 1
MSn retention times: 0:0 - 0:0 minutes
- - - Processing information - - -
Data loaded: Wed May 27 16:54:44 2020
MSnbase version: 2.12.0
- - - Meta data - - -
phenoData
rowNames: 1
varLabels: sampleNames fileNumbers
varMetadata: labelDescription
Loaded from:
SPECTRUM131538
protocolData: none
featureData
featureNames: X1
fvarLabels: PEPMASS CHARGE ... SCANS (19 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Data structure seems preserved
Exporting each data info and spectra into single tables (time it took : 10 mins) :
require(doParallel)
cl <- makeCluster(10)
registerDoParallel(cl)
foreach (i = list_spectra) %dopar% {
require(MSnbase)
mgf<-readMgfData(paste("gnpsdb/",i,sep=""))
df<-data.frame(mgf@featureData@data)
spectra<-data.frame(mz=mgf@assayData$X1@mz,
intensity=mgf@assayData$X1@intensity)
write.table(df,file=paste("gnpsdb/",df$SPECTRUMID,"_feat_data.txt",sep=""),row.names=F,sep='\t')
write.table(spectra,file=paste("gnpsdb/",df$SPECTRUMID,"_spectra.txt",sep=""),row.names=F,sep='\t')
}
parallel::stopCluster(cl)
> length(list.files(path="gnpsdb/",pattern = "_spectra.txt"))
[1] 155327
> length(list.files(path="gnpsdb/",pattern = "_feat_data.txt"))
[1] 155327
Now we can read/rbind all "feat_data" tables to build the "database" and read single "spectra.txt" when needed.
I don't know if this counts as solution for issue.
Best regards, Vivian
Hi,
This is rather an improvement than an issue. Being on lockdown doesn't give me access to a high memory computer and while working frome home, I encountered a protect() error while loading a "big" mgf file (gnps "all" library) using the readMgfData. I was wondering if it was possible to split the readMgfData function as chunks of the file (lets say 1000 first spectra and so on as :
Of course it should be necessary to know the length of the "file.mgf" in order to be able to process all chunks individually to completely import the dataset.
Best regards, Vivian
EDIT : there should be a workaround converting mgf to some other format outside of R (e.g. using proteowizard) ?