Closed vnaum closed 6 years ago
Hi Vladislav,
This looks like it's most likely an issue with either the mzML or pymzml version you're using rather than Spark; otherwise you would have seen the message Loaded {} MS2 spectra from {} in {} minutes
before Spark was invoked. Instead it seems that Spark is being passed an empty set of spectra. Could you let me know the version of pymzml you're using as well as the version of msconvert used to make the mzml, and the exact command you used at the terminal to run Specter? Thanks for pointing out the discrepancy with the 'scan time' key - I've updated this to the more universal accession key 'MS:1000016' which should work with all mzMLs.
please disregard. for some reason I read "At least 100 GB of cluster RAM is recommended." as "whole cluster should have 100Gb", not "each node should have 100Gb". And since OOM happened deep in Spark, theres zero indication on error except lines in /var/log/messages. I spawned single-node cluster with 122Gb ram and it... sort of works. At least, it passes through this bit. Fails afterwards, but theres log to work with. I'll create new issue once I collect enough data (or a pull request -- if there's any workaround/fix).
Thanks for replying!
you would have seen the message Loaded {} MS2 spectra from {} in {} minutes
It failed earlier than this :-)
the version of pymzml you're using
pymzml-0.7.8-py27_0, straingt from conda. I did copy that OBO file mentioned in PDF to where promram says it should be (/root/miniconda2/envs/SpecterEnv/lib/python2.7/site-packages/pymzml/obo/psi-ms-4.0.14.obo)
version of msconvert used to make the mzml
We used ProteWizard 3.0.11676. Also we are using a raw file and blib file from their PRIDE repo for the associated publication.
exact command you used at the terminal to run Specter
./Specter.sh 20g /mnt/CS20170831_SV_HEK_SpikeP100_108ng_Overlap22_01 /mnt/HEKAndP100HeavyLib 100000 end 200 orbitrap 10
Trying to get it working, but spark-submit command fails with this output:
Specter.sh then ignores errors (
set -e
would help) and gets to call R, but surely it fails (theres no input file).We're running Python 2.7.11, R version 3.4.1 (2017-06-30) -- "Single Candle", conda 4.4.10 and Spark 2.3.0 (Amazon's Elastic Map Reduce in pretty much default configuration)
maybe I missed something in installation guide, or my data file is broken? I had to patch
Specter_Spark.py
to handle 'scan start time' instead of 'scan time' it expects (for some reason mzML I got has 'scan start time') -- just search-and-replace, otherwise it fails much sooner withMaybe there are means to run the code w/o cluster wrappers to rule out misconfigured Spark?