mobiusklein / glycresoft

An LC-MS/MS glycan and glycopeptide search engine
https://mobiusklein.github.io/glycresoft/
Apache License 2.0
8 stars 8 forks source link

Error in Begin Evaluating Chromatograms #9

Closed bobaoai closed 2 years ago

bobaoai commented 5 years ago

Hi Joshua,

By observing the processed mzML file. My glycoprofile has adducts with formate, sodium and one Rapifluor tag. For example, the glycan A2FG2 with mass 1786.65 will be 2098.82 with one tag and H in the mass spectrum. So I want to deduct the mass from the adducts. Thus, in the search-glycan, I manually add lots of adducts, but I am confused the parameter to use in -f. First, could you help me check is my parameter right overall ? Second, the code returns the errors like below:

glycresoft mzml preprocess \
--averagine glycan \
--maximum-charge 8 \
--name "20190205_TK-EPO_Expres2ion_04" \
--processes 10 \
--background-reduction 5 \
--start-time 8 \
"/data/bokan/glycretest/20190205_TK-EPO_Expres2ion_04.mzXML" \
"/data/bokan/test_nooff.mzML"

glycresoft build-hypothesis glycan-glyspace \
-m n-linked \
-t 10029 \
glyspace-glycans.db \
-n "CHO N-Linked Glycans"

glycresoft analyze search-glycan \
-a "C-17H-21O-1N-5" 1 \
-a "H-2C-1O-2" 2 \
-a 'H-1' 2 \
-a 'Na-1' 2 \
-f formate-adduct-model \
-o test_nooff_all_0.db \
-m 6e-4 \
glyspace-glycans.db \
test_nooff.mzML 1 --export csv

Then it shows the following error


01:17:09 - glycresoft:task         :22   - INFO - glycresoft: version 0.3.12
01:17:09 - glycresoft:task         :264  - INFO - Begin MzML Glycan Chromatogram Analyzer
{'analysis': None,
 'analysis_name': u'20190205_TK-EPO_Expres2ion_04 @ CHO N-Linked Glycans',
 'database_connection': u'glyspace-glycans.db',
 'delta_rt': 0.5,
 'grouping_error_tolerance': 1.5e-05,
 'hypothesis_id': 1,
 'mass_error_tolerance': 0.0006,
 'mass_shifts': [MassShift(C-17H-21O-1N-5, Composition({'H': -21, 'C': -17, 'O': -1, 'N': -5})),
                 MassShift(Na-1, Composition({'Na': -1})),
                 MassShift(C-17H-21O-1N-5 + Na-1, Composition({'C': -17, 'Na': -1, 'O': -1, 'N': -5, 'H': -21})),
                 MassShift(Na-1 * 2, Composition({'Na': -2})),
                 MassShift(C-17H-21O-1N-5 + Na-1 * 2, Composition({'C': -17, 'Na': -2, 'O': -1, 'N': -5, 'H': -21})),
                 MassShift(H-1, Composition({'H': -1})),
                 MassShift(C-17H-21O-1N-5 + H-1, Composition({'H': -22, 'C': -17, 'O': -1, 'N': -5})),
                 MassShift(H-1 + Na-1, Composition({'Na': -1, 'H': -1})),
                 MassShift(C-17H-21O-1N-5 + H-1 + Na-1, Composition({'C': -17, 'Na': -1, 'O': -1, 'N': -5, 'H': -22})),
                 MassShift(H-1 + Na-1 * 2, Composition({'Na': -2, 'H': -1})),
                 MassShift(C-17H-21O-1N-5 + H-1 + Na-1 * 2, Composition({'C': -17, 'Na': -2, 'O': -1, 'N': -5, 'H': -22})),
                 MassShift(H-1 * 2, Composition({'H': -2})),
                 MassShift(C-17H-21O-1N-5 + H-1 * 2, Composition({'H': -23, 'C': -17, 'O': -1, 'N': -5})),
                 MassShift(H-1 * 2 + Na-1, Composition({'Na': -1, 'H': -2})),
                 MassShift(C-17H-21O-1N-5 + H-1 * 2 + Na-1, Composition({'C': -17, 'Na': -1, 'O': -1, 'N': -5, 'H': -23})),
                 MassShift(H-1 * 2 + Na-1 * 2, Composition({'Na': -2, 'H': -2})),
                 MassShift(C-17H-21O-1N-5 + H-1 * 2 + Na-1 * 2, Composition({'C': -17, 'Na': -2, 'O': -1, 'N': -5, 'H': -23})),
                 MassShift(H-2C-1O-2, Composition({'H': -2, 'C': -1, 'O': -2})),
                 MassShift(C-17H-21O-1N-5 + H-2C-1O-2, Composition({'H': -23, 'C': -18, 'O': -3, 'N': -5})),
                 MassShift(H-2C-1O-2 + Na-1, Composition({'Na': -1, 'C': -1, 'O': -2, 'H': -2})),
                 MassShift(C-17H-21O-1N-5 + H-2C-1O-2 + Na-1, Composition({'C': -18, 'Na': -1, 'O': -3, 'N': -5, 'H': -23})),
                 MassShift(H-2C-1O-2 + Na-1 * 2, Composition({'Na': -2, 'C': -1, 'O': -2, 'H': -2})),
                 MassShift(C-17H-21O-1N-5 + H-2C-1O-2 + Na-1 * 2, Composition({'C': -18, 'Na': -2, 'O': -3, 'N': -5, 'H': -23})),
                 MassShift(H-1 + H-2C-1O-2, Composition({'H': -3, 'C': -1, 'O': -2})),
                 MassShift(C-17H-21O-1N-5 + H-1 + H-2C-1O-2, Composition({'H': -24, 'C': -18, 'O': -3, 'N': -5})),
                 MassShift(H-1 + H-2C-1O-2 + Na-1, Composition({'Na': -1, 'C': -1, 'O': -2, 'H': -3})),
                 MassShift(C-17H-21O-1N-5 + H-1 + H-2C-1O-2 + Na-1, Composition({'C': -18, 'Na': -1, 'O': -3, 'N': -5, 'H': -24})),
                 MassShift(H-1 + H-2C-1O-2 + Na-1 * 2, Composition({'Na': -2, 'C': -1, 'O': -2, 'H': -3})),
                 MassShift(C-17H-21O-1N-5 + H-1 + H-2C-1O-2 + Na-1 * 2, Composition({'C': -18, 'Na': -2, 'O': -3, 'N': -5, 'H': -24})),
                 MassShift(H-1 * 2 + H-2C-1O-2, Composition({'H': -4, 'C': -1, 'O': -2})),
                 MassShift(C-17H-21O-1N-5 + H-1 * 2 + H-2C-1O-2, Composition({'H': -25, 'C': -18, 'O': -3, 'N': -5})),
                 MassShift(H-1 * 2 + H-2C-1O-2 + Na-1, Composition({'Na': -1, 'C': -1, 'O': -2, 'H': -4})),
                 MassShift(C-17H-21O-1N-5 + H-1 * 2 + H-2C-1O-2 + Na-1, Composition({'C': -18, 'Na': -1, 'O': -3, 'N': -5, 'H': -25})),
                 MassShift(H-1 * 2 + H-2C-1O-2 + Na-1 * 2, Composition({'Na': -2, 'C': -1, 'O': -2, 'H': -4})),
                 MassShift(C-17H-21O-1N-5 + H-1 * 2 + H-2C-1O-2 + Na-1 * 2, Composition({'C': -18, 'Na': -2, 'O': -3, 'N': -5, 'H': -25})),
                 MassShift(H-2C-1O-2 * 2, Composition({'H': -4, 'C': -2, 'O': -4})),
                 MassShift(C-17H-21O-1N-5 + H-2C-1O-2 * 2, Composition({'H': -25, 'C': -19, 'O': -5, 'N': -5})),
                 MassShift(H-2C-1O-2 * 2 + Na-1, Composition({'Na': -1, 'C': -2, 'O': -4, 'H': -4})),
                 MassShift(C-17H-21O-1N-5 + H-2C-1O-2 * 2 + Na-1, Composition({'C': -19, 'Na': -1, 'O': -5, 'N': -5, 'H': -25})),
                 MassShift(H-2C-1O-2 * 2 + Na-1 * 2, Composition({'Na': -2, 'C': -2, 'O': -4, 'H': -4})),
                 MassShift(C-17H-21O-1N-5 + H-2C-1O-2 * 2 + Na-1 * 2, Composition({'C': -19, 'Na': -2, 'O': -5, 'N': -5, 'H': -25})),
                 MassShift(H-1 + H-2C-1O-2 * 2, Composition({'H': -5, 'C': -2, 'O': -4})),
                 MassShift(C-17H-21O-1N-5 + H-1 + H-2C-1O-2 * 2, Composition({'H': -26, 'C': -19, 'O': -5, 'N': -5})),
                 MassShift(H-1 + H-2C-1O-2 * 2 + Na-1, Composition({'Na': -1, 'C': -2, 'O': -4, 'H': -5})),
                 MassShift(C-17H-21O-1N-5 + H-1 + H-2C-1O-2 * 2 + Na-1, Composition({'C': -19, 'Na': -1, 'O': -5, 'N': -5, 'H': -26})),
                 MassShift(H-1 + H-2C-1O-2 * 2 + Na-1 * 2, Composition({'Na': -2, 'C': -2, 'O': -4, 'H': -5})),
                 MassShift(C-17H-21O-1N-5 + H-1 + H-2C-1O-2 * 2 + Na-1 * 2, Composition({'C': -19, 'Na': -2, 'O': -5, 'N': -5, 'H': -26})),
                 MassShift(H-1 * 2 + H-2C-1O-2 * 2, Composition({'H': -6, 'C': -2, 'O': -4})),
                 MassShift(C-17H-21O-1N-5 + H-1 * 2 + H-2C-1O-2 * 2, Composition({'H': -27, 'C': -19, 'O': -5, 'N': -5})),
                 MassShift(H-1 * 2 + H-2C-1O-2 * 2 + Na-1, Composition({'Na': -1, 'C': -2, 'O': -4, 'H': -6})),
                 MassShift(C-17H-21O-1N-5 + H-1 * 2 + H-2C-1O-2 * 2 + Na-1, Composition({'C': -19, 'Na': -1, 'O': -5, 'N': -5, 'H': -27})),
                 MassShift(H-1 * 2 + H-2C-1O-2 * 2 + Na-1 * 2, Composition({'Na': -2, 'C': -2, 'O': -4, 'H': -6})),
                 MassShift(C-17H-21O-1N-5 + H-1 * 2 + H-2C-1O-2 * 2 + Na-1 * 2, Composition({'C': -19, 'Na': -2, 'O': -5, 'N': -5, 'H': -27}))],
 'minimum_mass': 500.0,
 'msn_mass_error_tolerance': 2e-05,
 'n_processes': 4,
 'network': None,
 'output_path': u'test_nooff_all_0.db',
 'regularization_model': None,
 'regularize': None,
 'require_msms_signature': 0.0,
 'sample_path': 'test_nooff.mzML',
 'sample_run_id': -1,
 'scoring_model': ChromatogramScorer('line_score': <glycan_profiling.scoring.shape_fitter.ChromatogramShapeModel object at 0x7fad904585d0>, 'isotopic_fit': <glycan_profiling.scoring.isotopic_fit.IsotopicPatternConsistencyModel object at 0x7fad90458610>, 'spacing_fit': <glycan_profiling.scoring.spacing_fitter.ChromatogramSpacingModel object at 0x7fad90458650>, 'charge_count': <glycan_profiling.scoring.base.CompositionDispatchingModel object at 0x7fad90798190>, 'mass_shift_score': <glycan_profiling.models.mass_shift_models.GeneralizedFormateMassShiftModel object at 0x7fad9046b550>),
 'start_time': datetime.datetime(2019, 4, 13, 1, 17, 9, 591775),
 'status': 'started'}
00:52:18 - glycresoft:process      :53   - INFO - Begin Matching Chromatograms
00:52:18 - glycresoft:extract      :75   - INFO - ... Begin Extracting Chromatograms
00:52:40 - glycresoft:extract      :77   - INFO - ...... Aggregating Chromatograms
00:52:41 - glycresoft:extract      :56   - INFO - ... 661 Chromatograms Extracted.
00:52:42 - glycresoft:match        :286  - INFO - Matching Chromatograms
00:52:42 - glycresoft:match        :305  - INFO - Handling mass_shifts
00:52:42 - glycresoft:match        :180  - INFO - Begin Forward Search
01:00:35 - glycresoft:match        :123  - INFO - Begin Reverse Search
01:01:11 - glycresoft:match        :233  - INFO - Building Connected Components
01:01:26 - glycresoft:match        :239  - INFO - Validating 8 Components
01:01:27 - glycresoft:process      :55   - INFO - End Matching Chromatograms
01:01:27 - glycresoft:process      :56   - INFO - 303 Chromatogram Candidates Found
01:01:27 - glycresoft:process      :60   - INFO - Begin Evaluating Chromatograms
Could not resolve element_symbol any
Could not find element any
Could not resolve element_symbol any
Could not find element any
Could not resolve element_symbol any
Could not find element any
Could not resolve element_symbol any
Could not find element any
Could not resolve element_symbol any
Could not find element any
Could not resolve element_symbol any
Could not find element any
Could not resolve element_symbol any
Could not find element any
Could not resolve element_symbol any
Could not find element any
Could not resolve element_symbol any
Could not find element any

.....hundreds of this.......

Segmentation fault (core dumped)```
mobiusklein commented 5 years ago

Your adduct formulae should be positive, being added to the elemental composition of the glycans. Keep in mind the formulae should be specified with gain/loss of hydrogen according to whether they are charge carriers and polarity. What are the chemical shifts applied by your reducing end tag?

Your mass accuracy parameter is too large. If your data were acquired from an Orbitrap or Q-TOF instrument, use between 5e-6 (5 ppm) and 2e-5 (20 ppm). No instrument in use today should have an acceptable error range of 6e-3 (6,000 ppm). This could lead to bizarre chromatogram merges and that could be what is triggering the segfault.

The -f formate-adduct-model assumes you're using the built-in Formate name for the adduct, with a mass shift formula of CO2H2, as it appears in negative mode. It will adjust the score based upon how likely it thinks that the composition it sees will have a Formate adduct.


The messages about "element any" is a warning that the elemental formula of something (likely a monosaccharide or mass shift) contains the string "any", which the formula parser doesn't know how to understand. That said, I don't see where it is coming from. None of your mass shift formulae have it, nor do any glycan returned by the query that built your database. Even so, the error message aborts the operation safely, so it isn't the direct cause of the segfault.

bobaoai commented 5 years ago

Thanks a lot! Just a follow up, I used the updated parameter today and now have the segfault error you mentioned(even with 5-10ppm tolerance). And also The mzML file looks as good as expected, I can manually spot there are sodium, formate, potassium and proton adducts for one glycans. But it is very hard to annotate with the correct glycan by using the search-glycan.

Neutral Mass | Total Signal | Charge States | Start Time | Apex Time | End Time 1236.4962 | 2.85E+04 | 2 | 11.89 | 11.89 | 12.01 2097.7955 | 2.34E+05 | 2 | 18.98 | 19.13 | 19.18 2097.8299 | 5.72E+05 | 2 | 18.98 | 19.15 | 19.22 2097.8641 | 2.18E+05 | 2 | 18.98 | 19.15 | 19.24 2097.9024 | 3.34E+05 | 2 | 18.98 | 19.07 | 19.23 2097.9515 | 4.96E+05 | 2 | 18.97 | 19.06 | 19.24 2098.7789 | 1.89E+05 | 2 | 19.04 | 19.06 | 19.17 2098.8163 | 3.80E+05 | 2 | 19.02 | 19.07 | 19.17 2098.8548 | 4.17E+05 | 2 | 18.99 | 19.08 | 19.16 2098.9071 | 2.37E+05 | 2 | 19.06 | 19.14 | 19.18 2098.9486 | 2.24E+05 | 2 | 19.08 | 19.1 | 19.16 2098.9874 | 1.60E+05 | 2 | 19.03 | 19.13 | 19.16 2099.7359 | 1.00E+05 | 2 | 19.13 | 19.13 | 19.15 2099.807 | 1.25E+05 | 2 | 19.09 | 19.09 | 19.1 2099.8883 | 1.20E+05 | 2 | 19.07 | 19.07 | 19.12 2099.9657 | 1.59E+05 | 2 | 19.04 | 19.13 | 19.14 2119.6939 | 3.93E+04 | 2 | 19.07 | 19.07 | 19.13 \potassium 2119.8604 | 4.14E+04 | 2 | 19.06 | 19.11 | 19.14 2119.9408 | 3.36E+04 | 2 | 19.08 | 19.08 | 19.13 2142.8224 | 3.49E+05 | 2 | 19 | 19.07 | 19.18 \ 2142.8731 | 3.16E+05 | 2 | 19.01 | 19.07 | 19.18 2142.9469 | 3.31E+05 | 2 | 19.04 | 19.07 | 19.19 2142.9832 | 1.40E+05 | 2 | 19.02 | 19.1 | 19.14 \formate 2143.7779 | 7.61E+04 | 2 | 19.04 | 19.1 | 19.16 2388.8856 | 4.54E+05 | 2 | 21.11 | 21.39 | 21.45 2388.9457 | 7.20E+05 | 2 | 21.11 | 21.36 | 21.48 2389.0067 | 6.21E+05 | 2 | 21.1 | 21.4 | 21.46 2389.0562 | 4.44E+05 | 2 | 21.27 | 21.38 | 21.5 2389.9725 | 1.10E+05 | 2 | 21.37 | 21.39 | 21.45 2433.8644 | 3.51E+04 | 2 | 21.34 | 21.34 | 21.43 2433.9474 | 5.54E+04 | 2 | 21.35 | 21.36 | 21.4 2433.9969 | 1.46E+05 | 2 | 21.33 | 21.39 | 21.43 2462.8602 | 3.35E+04 | 2 | 23.35 | 23.46 | 23.46 2462.9242 | 4.61E+04 | 2 | 23.38 | 23.38 | 23.45 2462.9735 | 5.62E+04 | 2 | 23.36 | 23.38 | 23.42 2463.016 | 8.71E+04 | 2 | 23.35 | 23.4 | 23.44 2679.9693 | 2.10E+05 | 2 | 23.48 | 23.55 | 23.61 2680.0199 | 3.96E+05 | 2 | 23.48 | 23.57 | 23.63 2680.0624 | 3.17E+04 | 2 | 23.62 | 23.62 | 23.64 2680.1056 | 3.52E+05 | 2 | 23.51 | 23.57 | 23.63 2680.1939 | 1.11E+05 | 2 | 23.52 | 23.58 | 23.6

glycresoft analyze search-glycan 
-a "C17H21O1N5" 1 
-a Formate 1 
-a Sodium 1 
-a Potassium 1 
-a 'H1' 3 
-o "test_nooff_xx2.db" -m 5e-6   combinatorial-database test_nooff.mzML 1 --export csv

Preparing analysis of 20190205_TK-EPO_Expres2ion_04 by Combinatorial CHO N-Glycans
18:51:06 - glycresoft:task         :22   - INFO - glycresoft: version 0.3.12
18:51:06 - glycresoft:task         :264  - INFO - Begin MzML Glycan Chromatogram Analyzer
{'analysis': None,
 'analysis_name': u'20190205_TK-EPO_Expres2ion_04 @ Combinatorial CHO N-Glycans',
 'database_connection': u'combinatorial-database',
 'delta_rt': 0.5,
 'grouping_error_tolerance': 1.5e-05,
 'hypothesis_id': 1,
 'mass_error_tolerance': 5e-06,
 'mass_shifts': [MassShift(Sodium, Composition({'Na': 1})),
                 MassShift(H1, Composition({'H': 1})),
                 MassShift(H1 + Sodium, Composition({'H': 1, 'Na': 1})),
                 MassShift(H1 * 2, Composition({'H': 2})),
                 MassShift(H1 * 2 + Sodium, Composition({'H': 2, 'Na': 1})),
                 MassShift(H1 * 3, Composition({'H': 3})),
                 MassShift(H1 * 3 + Sodium, Composition({'H': 3, 'Na': 1})),
                 MassShift(C17H21O1N5, Composition({'H': 21, 'C': 17, 'O': 1, 'N': 5})),
                 MassShift(C17H21O1N5 + Sodium, Composition({'C': 17, 'H': 21, 'O': 1, 'N': 5, 'Na': 1})),
                 MassShift(C17H21O1N5 + H1, Composition({'H': 22, 'C': 17, 'O': 1, 'N': 5})),
                 MassShift(C17H21O1N5 + H1 + Sodium, Composition({'C': 17, 'H': 22, 'O': 1, 'N': 5, 'Na': 1})),
                 MassShift(C17H21O1N5 + H1 * 2, Composition({'H': 23, 'C': 17, 'O': 1, 'N': 5})),
                 MassShift(C17H21O1N5 + H1 * 2 + Sodium, Composition({'C': 17, 'H': 23, 'O': 1, 'N': 5, 'Na': 1})),
                 MassShift(C17H21O1N5 + H1 * 3, Composition({'H': 24, 'C': 17, 'O': 1, 'N': 5})),
                 MassShift(C17H21O1N5 + H1 * 3 + Sodium, Composition({'C': 17, 'H': 24, 'O': 1, 'N': 5, 'Na': 1})),
                 MassShift(Potassium, Composition({'K': 1})),
                 MassShift(Potassium + Sodium, Composition({'Na': 1, 'K': 1})),
                 MassShift(H1 + Potassium, Composition({'H': 1, 'K': 1})),
                 MassShift(H1 + Potassium + Sodium, Composition({'H': 1, 'K': 1, 'Na': 1})),
                 MassShift(H1 * 2 + Potassium, Composition({'H': 2, 'K': 1})),
                 MassShift(H1 * 2 + Potassium + Sodium, Composition({'H': 2, 'K': 1, 'Na': 1})),
                 MassShift(H1 * 3 + Potassium, Composition({'H': 3, 'K': 1})),
                 MassShift(H1 * 3 + Potassium + Sodium, Composition({'H': 3, 'K': 1, 'Na': 1})),
                 MassShift(C17H21O1N5 + Potassium, Composition({'C': 17, 'H': 21, 'K': 1, 'O': 1, 'N': 5})),
                 MassShift(C17H21O1N5 + Potassium + Sodium, Composition({'C': 17, 'H': 21, 'K': 1, 'O': 1, 'N': 5, 'Na': 1})),
                 MassShift(C17H21O1N5 + H1 + Potassium, Composition({'C': 17, 'H': 22, 'K': 1, 'O': 1, 'N': 5})),
                 MassShift(C17H21O1N5 + H1 + Potassium + Sodium, Composition({'C': 17, 'H': 22, 'K': 1, 'O': 1, 'N': 5, 'Na': 1})),
                 MassShift(C17H21O1N5 + H1 * 2 + Potassium, Composition({'C': 17, 'H': 23, 'K': 1, 'O': 1, 'N': 5})),
                 MassShift(C17H21O1N5 + H1 * 2 + Potassium + Sodium, Composition({'C': 17, 'H': 23, 'K': 1, 'O': 1, 'N': 5, 'Na': 1})),
                 MassShift(C17H21O1N5 + H1 * 3 + Potassium, Composition({'C': 17, 'H': 24, 'K': 1, 'O': 1, 'N': 5})),
                 MassShift(C17H21O1N5 + H1 * 3 + Potassium + Sodium, Composition({'C': 17, 'H': 24, 'K': 1, 'O': 1, 'N': 5, 'Na': 1})),
                 MassShift(Formate, Composition({'H': 2, 'C': 1, 'O': 2})),
                 MassShift(Formate + Sodium, Composition({'H': 2, 'C': 1, 'O': 2, 'Na': 1})),
                 MassShift(Formate + H1, Composition({'H': 3, 'C': 1, 'O': 2})),
                 MassShift(Formate + H1 + Sodium, Composition({'H': 3, 'C': 1, 'O': 2, 'Na': 1})),
                 MassShift(Formate + H1 * 2, Composition({'H': 4, 'C': 1, 'O': 2})),
                 MassShift(Formate + H1 * 2 + Sodium, Composition({'H': 4, 'C': 1, 'O': 2, 'Na': 1})),
                 MassShift(Formate + H1 * 3, Composition({'H': 5, 'C': 1, 'O': 2})),
                 MassShift(Formate + H1 * 3 + Sodium, Composition({'H': 5, 'C': 1, 'O': 2, 'Na': 1})),
                 MassShift(C17H21O1N5 + Formate, Composition({'H': 23, 'C': 18, 'O': 3, 'N': 5})),
                 MassShift(C17H21O1N5 + Formate + Sodium, Composition({'C': 18, 'H': 23, 'O': 3, 'N': 5, 'Na': 1})),
                 MassShift(C17H21O1N5 + Formate + H1, Composition({'H': 24, 'C': 18, 'O': 3, 'N': 5})),
                 MassShift(C17H21O1N5 + Formate + H1 + Sodium, Composition({'C': 18, 'H': 24, 'O': 3, 'N': 5, 'Na': 1})),
                 MassShift(C17H21O1N5 + Formate + H1 * 2, Composition({'H': 25, 'C': 18, 'O': 3, 'N': 5})),
                 MassShift(C17H21O1N5 + Formate + H1 * 2 + Sodium, Composition({'C': 18, 'H': 25, 'O': 3, 'N': 5, 'Na': 1})),
                 MassShift(C17H21O1N5 + Formate + H1 * 3, Composition({'H': 26, 'C': 18, 'O': 3, 'N': 5})),
                 MassShift(C17H21O1N5 + Formate + H1 * 3 + Sodium, Composition({'C': 18, 'H': 26, 'O': 3, 'N': 5, 'Na': 1})),
                 MassShift(Formate + Potassium, Composition({'H': 2, 'C': 1, 'K': 1, 'O': 2})),
                 MassShift(Formate + Potassium + Sodium, Composition({'C': 1, 'H': 2, 'K': 1, 'O': 2, 'Na': 1})),
                 MassShift(Formate + H1 + Potassium, Composition({'H': 3, 'C': 1, 'K': 1, 'O': 2})),
                 MassShift(Formate + H1 + Potassium + Sodium, Composition({'C': 1, 'H': 3, 'K': 1, 'O': 2, 'Na': 1})),
                 MassShift(Formate + H1 * 2 + Potassium, Composition({'H': 4, 'C': 1, 'K': 1, 'O': 2})),
                 MassShift(Formate + H1 * 2 + Potassium + Sodium, Composition({'C': 1, 'H': 4, 'K': 1, 'O': 2, 'Na': 1})),
                 MassShift(Formate + H1 * 3 + Potassium, Composition({'H': 5, 'C': 1, 'K': 1, 'O': 2})),
                 MassShift(Formate + H1 * 3 + Potassium + Sodium, Composition({'C': 1, 'H': 5, 'K': 1, 'O': 2, 'Na': 1})),
                 MassShift(C17H21O1N5 + Formate + Potassium, Composition({'C': 18, 'H': 23, 'K': 1, 'O': 3, 'N': 5})),
                 MassShift(C17H21O1N5 + Formate + Potassium + Sodium, Composition({'C': 18, 'H': 23, 'K': 1, 'O': 3, 'N': 5, 'Na': 1})),
                 MassShift(C17H21O1N5 + Formate + H1 + Potassium, Composition({'C': 18, 'H': 24, 'K': 1, 'O': 3, 'N': 5})),
                 MassShift(C17H21O1N5 + Formate + H1 + Potassium + Sodium, Composition({'C': 18, 'H': 24, 'K': 1, 'O': 3, 'N': 5, 'Na': 1})),
                 MassShift(C17H21O1N5 + Formate + H1 * 2 + Potassium, Composition({'C': 18, 'H': 25, 'K': 1, 'O': 3, 'N': 5})),
                 MassShift(C17H21O1N5 + Formate + H1 * 2 + Potassium + Sodium, Composition({'C': 18, 'H': 25, 'K': 1, 'O': 3, 'N': 5, 'Na': 1})),
                 MassShift(C17H21O1N5 + Formate + H1 * 3 + Potassium, Composition({'C': 18, 'H': 26, 'K': 1, 'O': 3, 'N': 5})),
                 MassShift(C17H21O1N5 + Formate + H1 * 3 + Potassium + Sodium, Composition({'C': 18, 'H': 26, 'K': 1, 'O': 3, 'N': 5, 'Na': 1}))],
 'minimum_mass': 500.0,
 'msn_mass_error_tolerance': 2e-05,
 'n_processes': 4,
 'network': None,
 'output_path': u'test_nooff_xx2.db',
 'regularization_model': None,
 'regularize': None,
 'require_msms_signature': 0.0,
 'sample_path': 'test_nooff.mzML',
 'sample_run_id': -1,
 'scoring_model': ChromatogramScorer('line_score': <glycan_profiling.scoring.shape_fitter.ChromatogramShapeModel object at 0x7f6cb5f4c5d0>, 'isotopic_fit': <glycan_profiling.scoring.isotopic_fit.IsotopicPatternConsistencyModel object at 0x7f6cb5f4c610>, 'spacing_fit': <glycan_profiling.scoring.spacing_fitter.ChromatogramSpacingModel object at 0x7f6cb5f4c650>, 'charge_count': <glycan_profiling.scoring.base.CompositionDispatchingModel object at 0x7f6cb628c190>),
 'start_time': datetime.datetime(2019, 4, 13, 18, 51, 6, 978049),
 'status': 'started'}
18:51:10 - glycresoft:profiler     :458  - INFO - The smallest possible database mass is 910.327780, raising the minimum mass to extract.
18:51:10 - glycresoft:process      :53   - INFO - Begin Matching Chromatograms
18:51:10 - glycresoft:extract      :75   - INFO - ... Begin Extracting Chromatograms
18:52:00 - glycresoft:extract      :77   - INFO - ...... Aggregating Chromatograms
18:52:00 - glycresoft:extract      :56   - INFO - ... 289 Chromatograms Extracted.
18:52:01 - glycresoft:match        :286  - INFO - Matching Chromatograms
18:52:01 - glycresoft:match        :305  - INFO - Handling mass_shifts
18:52:01 - glycresoft:match        :180  - INFO - Begin Forward Search
18:52:02 - glycresoft:match        :123  - INFO - Begin Reverse Search
18:52:02 - glycresoft:match        :233  - INFO - Building Connected Components
18:52:02 - glycresoft:match        :239  - INFO - Validating 102 Components
18:52:02 - glycresoft:process      :55   - INFO - End Matching Chromatograms
18:52:02 - glycresoft:process      :56   - INFO - 129 Chromatogram Candidates Found
18:52:02 - glycresoft:process      :60   - INFO - Begin Evaluating Chromatograms
18:52:55 - glycresoft:evaluate     :138  - INFO - Collapsing Duplicates
18:52:55 - glycresoft:evaluate     :91   - INFO - Pruning mass shift branches
18:52:55 - glycresoft:evaluate     :93   - INFO - Re-evaluating after mass shift pruning
18:52:56 - glycresoft:evaluate     :138  - INFO - Collapsing Duplicates
18:52:56 - glycresoft:process      :73   - INFO - End Evaluating Chromatograms
18:52:56 - glycresoft:profiler     :506  - INFO - Saving solutions
18:52:57 - glycresoft:analysis_migr:231  - INFO - Migrating Hypothesis
18:52:58 - glycresoft:analysis_migr:233  - INFO - Migrating Sample Run
18:52:59 - glycresoft:analysis_migr:212  - INFO - ... Migrating Scans
18:52:59 - glycresoft:analysis_migr:217  - INFO - ... Migrating Peaks
18:52:59 - glycresoft:analysis_migr:235  - INFO - Creating Analysis Record
18:53:02 - glycresoft:task         :270  - INFO - End MzML Glycan Chromatogram Analyzer
18:53:02 - glycresoft:task         :271  - INFO - Started at 2019-04-13 18:51:06.978049.
Ended at 2019-04-13 18:53:02.040953.
Total time elapsed: 0:01:55.062904
MzMLGlycanChromatogramAnalyzer completed successfully.

2019-04-13 18:53:02.041770 Handling Export: csv
Segmentation fault (core dumped)
mobiusklein commented 5 years ago

The search completed, and the result file (test_nooff_xx2.db) was created, but it segfaulted during the writing of the CSV export, or did the CSV file get written and then the program segfaulted?

Are your data in negative mode?

When you specify that many adducts, you're allowing the algorithm a lot of freedom to match the mass of any chromatographic feature with a glycan identity. Do you have MS2 spectra to support them?

mobiusklein commented 5 years ago

What do you define a proton adduct as?

All mass calculations are done in the neutral mass domain. If a proton adduct is the gain of a hydrogen that does not change the charge state of the ion, independent of another chemical modification, then it is valid.

bobaoai commented 5 years ago

Hi Joshua,

Thanks for the prompt reply!

The segfault error occurs when exporting the CSV file, so I get a 0 byte csv file.

I was told by the colleague who did the experiment that our glycan might have these adducts(potassium, sodium, formate) and most of our glycans are 2+ charged. Then I also assumed the 2+ charges can be caused by 2 H+. Also we have a Flora tag 'C17H21O1N5' that restrictedly attached to every glycans. It would be must easier if I can directly manipulate the mass_shifts in search_glycan instead of letting the code generate all combination, can I try this by modifying python code? If adding adducts like that doesn't make sense to you, I will check with my colleague.

I felt sorry that I am still learning annotating the mass spectrum and being troublesome. I will try to make everything clear from my end first. Thanks for your generous help!

mobiusklein commented 5 years ago

Thank you for the clarification. Mass spectrometry can be complicated, and that requires the software be complicated, though perhaps I could have made the interface easier to use. That you're getting a segfault there is strange. Would you be able to share the output file and inputs with me?

From the sign of the charge you're mentioning, the data you have was acquired in positive mode. I don't think you need to specify a proton adduct (see below for a long-winded explanation), but Sodium and Potassium were created for negative mode data, so you'll need to specify the formula Na1H-1 and K1H-1 to reflect that they are replacing a proton for calculating neutral mass .

If you want to manipulate the logic yourself directly, search-glycan creates a glycan_profiling.profiler.MzMLGlycanChromatogramAnalyzer object which in turn instantiates a glycan_profiling.trace.process.LogitSumChromatogramProcessor, which finally creates an instance of glycan_profiling.trace.match.GlycanChromatogramMatcher which does the actual matching of chromatograms against a database and handles the matching of adducts.

During the setup code of search-glycan the program generates a Cartesian product of your adduct specification using MzMLGlycanChromatogramAnalyzer.expand_mass_shifts. You would instead want to create your mass shifts directly using arithmetic operations on glycan_profiling.chromatogram_tree.MassShift objects.


When you preprocessed the mzML file, the deconvoluter identified isotopic patterns, identified monoisotopic peak and charge state, and then replaced them with a single peak whose m/z was the monosiotopic peak's m/z, intensity equal to the sum of all the isotopic peaks' intensities, and a known charge. This information was then written out into a new mzML file. When calculating the neutral mass of an ion (which was used to re-calibrate the true monoisotopic m/z), it used the formula (mz * |z|) - (z * c) where c is the mass of the charge carrier. The charge carrier may be many things (see the Fiehn Lab's Table for examples), but for simplicity's sake, the deconvoluter always uses the mass of a proton (1.0072) for c.

This means a proton adduct isn't necessary. It also means that when in positive mode, when you have an ion which has a charge carrier that isn't a proton, you need to deduct the proton when you add the other charge carrier. This means that if you have a sodium adduct, you would specify it as Na1H-1. The named adducts other than Ammonium were created when I was dealing entirely with negative mode data, and there wasn't a good mechanism for expressing the polarity dependence.

bobaoai commented 5 years ago

Thanks for every great suggestion you mentioned!

I tested the reformated adducts as you mentioned on the Windows version glycresoft. The result looks better. However, I ran into some bugs on linux command line version:

  1. For the adducts, if I use the
    glycresoft analyze search-glycan -a "C17H21O1N5" 1 -a 'H1' 1 -a 'NaH-1' 1 -a Formate 1 -a 'KH-1' 1 -o test_04_14_0.db -m 1e-5 combinatorial-database test_nooff_0413_averaging2.mzML 1 --export csv

    it will give hundreds of the errors:

    Could not resolve element_symbol key
    Could not find element key

    If I changed to (I manually deducted one hydrogen from FloraTag C17H21O1N5)

    glycresoft analyze search-glycan -a "C17H20O1N5" 1 -a 'H1' 1 -a Sodium 1 -a Formate 1 -a Potassium 1 -o test_04_14_0.db -m 1e-5 combinatorial-database test_nooff_0413_averaging2.mzML 1 --export csv

    It will return error when export the cvs file. I explicitly checked with the following code:

glycresoft export glycan-identification -r -o aa.html test_04_14_0.db 1
Segmentation fault (core dumped)

I will sent you the mzML file and the parameters I used through the google drive. After solving these, I am looking forward to try customized the mass shift files one linux version. You are very very helpful. Appreciated!

mobiusklein commented 5 years ago

I think I found the problem related to memory corruption from brainpy. Can you tell me which version you're using?

bobaoai commented 5 years ago

I am using the python from miniconda2

Python 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 19:04:19)

Should I use another version of python?

I upgraded to 1.5.3 it works now

brain-isotopic-distribution in /home/bokan/miniconda2/lib/python2.7/site-packages (1.5.3)

Thanks for the help!

mobiusklein commented 5 years ago

Could you try installing brainpy from master at https://github.com/mobiusklein/brainpy and run again? I was experimenting with string interning, but though it didn't seem to cause an issue during testing, it did allow memory corruption to leak in when the appropriate preconditions weren't met.

mobiusklein commented 5 years ago

Also, this is ion trap data you're looking at. The mass accuracy on these instruments is on the order of 5e-4, and the resolution may be insufficient to separate isotopic peaks. GlycReSoft was written for high resolution instruments, so I don't know how it well behave on this.

bobaoai commented 5 years ago

Hi Joshua,

Last time, I updated the brainpy and the problem solved.

Yesterday, I was playing with the function search_glycan in cli.analyze and tried to modify the 'mass_shitfs' and 'expand' variables, to fix the FloraTag.

I have one question and find one bug now.

The first question is what is about the variable hypothesis_identifier, looks like its should be an int according to

try:
        hypothesis = get_by_name_or_id(
        database_connection, GlycanHypothesis, hypothesis_identifier)
except Exception:
        click.secho("Could not locate a Glycan Hypothesis with identifier %r" %
                    hypothesis_identifier, fg='yellow')
        raise click.Abort()

I only find it might be generated here, but I am not sure. Is it a fixed variable with certain value?

@click.pass_context
@database_connection()
@sample_path
@hypothesis_identifier("glycan")

The second question happens when I am trying to import cli.profiler

sys.path.insert(0, '/home/bokan/glycresoft/')
sys.path.insert(0, '/home/bokan/glycresoft/glycan_profiling/')
from profiler import *

It said these items cannot be imported, when I used the original code which are marked with "#". In order to import in notebook, have to modify the code as showed below:

ms_peak_picker.__init__.py

#from . import peak_statistics
import peak_statistics
#from . import search
import search

#from . import fticr_denoising
import fticr_denoising

#from . import fft_patterson_charge_state
import fft_patterson_charge_state

#from . import scan_filter

One bug I find is below, I just recompiled the glycresoft and have this issue. It looks like you just updated the ms_deisotope and the bug pops up.

 glycresoft analyze search-glycan -a "C17H21O1N5" 1 -a 'H1' 1 -a 'NaH-1' 1 -a Formate 1 -a 'KH-1' 1 -o test_04_14_5.db -m 1e-5 combinatorial-database test_nooff_0413_averaging10.mzML 1 --export csv

Traceback (most recent call last):
  File "/home/bokan/miniconda2/bin/glycresoft", line 11, in <module>
    load_entry_point('glycan-profiling==0.3.12', 'console_scripts', 'glycresoft')()
  File "/home/bokan/miniconda2/lib/python2.7/site-packages/glycan_profiling-0.3.12-py2.7-linux-x86_64.egg/glycan_profiling/cli/__main__.py", line 39, in main
    base.cli.main(standalone_mode=True)
  File "/home/bokan/miniconda2/lib/python2.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/bokan/miniconda2/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/bokan/miniconda2/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/bokan/miniconda2/lib/python2.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/bokan/miniconda2/lib/python2.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/bokan/miniconda2/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/bokan/miniconda2/lib/python2.7/site-packages/glycan_profiling-0.3.12-py2.7-linux-x86_64.egg/glycan_profiling/cli/analyze.py", line 533, in search_glycan
    sample_run = ms_data.sample_run
  File "/home/bokan/miniconda2/lib/python2.7/site-packages/ms_deisotope-0.0.8-py2.7-linux-x86_64.egg/ms_deisotope/output/mzml.py", line 1156, in sample_run
    self._sample_run = self._make_sample_run()
  File "/home/bokan/miniconda2/lib/python2.7/site-packages/ms_deisotope-0.0.8-py2.7-linux-x86_64.egg/ms_deisotope/output/mzml.py", line 1120, in _make_sample_run
    samples = self.samples()
  File "/home/bokan/miniconda2/lib/python2.7/site-packages/ms_deisotope-0.0.8-py2.7-linux-x86_64.egg/ms_deisotope/data_source/mzml.py", line 552, in samples
    name = sample_.pop("sampleName", None)
AttributeError: 'str' object has no attribute 'pop'
mobiusklein commented 5 years ago

Thank you for reporting the issue about sample_run

I was working on an idle chore of wrapping the sampleList component of mzML in a proper API, but obviously it was not covered by any tests within ms_deisotope. I've fixed this in ms_deisotope/b96b1a0.

mobiusklein commented 5 years ago

As for the hypothesis identifier, it should be one of

  1. The integer primary key value of the hypothesis. Usually 1 unless you intentionally store multiple distinct hypotheses of the same type in the same database file.
  2. The human readable name of the hypothesis, usually specified with the -n option when building the hypothesis. If a name wasn't provided, a UUID-based name is generated. If there are spaces in the name, you'll need to quote it.

The hypothesis_identifier decorator is a function that returns an appropriate click.Argument object with the right documentation.

The import issue is difficult to follow. Why aren't you using an editable installation of glycresoft, and then just running from glycan_profiling import profiler? The imports you're commenting out are in ms_peak_picker, a totally different module. You're using Python 2.7, correct? By appending glycan_profiling to your path, you're making each submodule a top-level namespace, potentially masking many imports.

bobaoai commented 5 years ago

Perfect! I just duplicated and modified a new search_glycan_with_tag in cli.analyze that has a new parameter --tag which is like a fixed size mass_shits. Then I run the glycresoft analyze search-glycan-with-tag in command line and it works perfectly.

bobaoai commented 5 years ago

Hi Joshua, I find an issue when the glycresoft captures the monoisotopic peak. The raw data looks like this: image

For example, in this case, I was told that the base should be the most left peak which is around 1050.0 however when I pull out the processed mzML file I find the peak glycresoft annotated is around 1050.42, the highest peak. Since it is a peak with 2 charges it causes 1 mass shift on the profile. So it makes the processed profile looks like this. image

Thus, the correct monoisotopic peak is found in a part of retention time but a peak with one or two mass shift is also assigned in the other part of retention time, because the base peak doesn't have the highest abundance. Especially around (19.08 min; m/z 1049.9) image

What do you think about this issue? Is it easy to solve this problem in glycresoft? If it is very hard to solve, I can just add one proton in mass_shift to handle this.

mobiusklein commented 5 years ago

Your data are from an ion trap, which means the mass error range is much larger than what GlycReSoft was designed for. The deconvolution algorithm assumes up to 20 ppm mass error within an isotopic pattern, but your instrument's range is between 50 and 100 ppm. This means that it is much easier for the algorithm to simply not be able to consider one of the peaks in the isotopic pattern that looks like it is present simply because the m/z of the centroid is a little too far to the left/right.

You could add "error_tolerance": 1e-4 to the ms1_deconvolution_args, overriding the 2e-5 default for the preprocessing tool, but the peaks in this data are still going to be low resolution, which means that just finding the centroid may introduce more error.

Are you using scan averaging (-g)? That may help smooth over the problem, especially if you don't have MS2 scans and you can use a value larger than 1.

bobaoai commented 5 years ago

Sounds great! I will try to check with "error_tolerance": 1e-4. Yes, I tried using scan averaging and it is helpful.

So far the GlycReSoft works promisingly in our preliminary annotation. We are looking forward to the final result.