sirius-ms / sirius

SIRIUS is a software for discovering a landscape of de-novo identification of metabolites using tandem mass spectrometry. This repository contains the code of the SIRIUS Software (GUI and CLI)
GNU Affero General Public License v3.0
84 stars 20 forks source link

Formula decomposition of higher masses #27

Closed nirshahaf closed 2 months ago

nirshahaf commented 3 years ago

Sirius team,

Apart from the blocking issues which seem (independently) be due to connection problems with the CSI server, I have a very slow convergence of the algorithm when running on .ms file of a larger compound of m/z=1388 corresponding with [M+NH4]+. The running parameters where optimized for this situation - by reducing the PPM threshold and reserving sufficient resources - nevertheless the Sirius command line took roughly a day(!) to conclude, see:

sirius -i 015.ms -o ./Sirius/015 --cores=12 --ignore-formula formula --ppm-max=5 --ppm-max-ms2=15 -e='CHNOPSCl[3]Br[2]Na' --candidates=10 --ilp-solver='GUROBI'

The input spectra below, have you any idea? Maybe related to the Java run time parameters? Or to the large number of putative fragments (originating from DIA method)??

compound NP-008069 formula C61H94O34 parentmass 1388.59904067667 ionization [M+NH4]+

ms1 1371.57494392328 1025.9755859375 1372.57849610784 663.23046875 1373.56091308594 133.44140625 1373.58071729741 333.5439453125 1374.58309329047 97.9529418945312 1375.5654840212 39.0334777832031 1375.61694335938 30.9311218261719 1388.59904067667 1393.7451171875 1389.60537001044 919.021484375 1390.61263799646 394.663330078125 1391.02517184432 17.2550659179688 1391.61883311232 93.4871826171875 1391.64147949219 74.007080078125 1392.65447228716 23.4032440185547

ms2 83.0464075469008 245.248413085938 84.0492461711597 12.1519012451172 105.068158640417 32.4050598144531 111.042515340518 1430.7685546875 111.407310723258 8.61387634277344 111.443288110774 10.126579284668 111.49466964541 6.84843826293945 111.56425560776 8.10126495361328 111.654724121821 8.10126495361328 111.794356094135 6.07595062255859 111.880957345395 11.1485443115234 111.954912147972 15.1898727416992 112.047352787194 123.935729980469 112.483057235946 8.10126495361328 113.04924080486 18.2278442382812 133.101032172488 41.5189819335938 147.11757193408 53.6708984375 153.053653994456 2304.45703125 153.388392130322 9.11392211914062 153.516252004507 12.1139068603516 153.545457032096 8.71035003662109 153.593770181811 15.3144912719727 153.64470464468 19.5127868652344 153.704257084699 13.1645584106445 153.728306154381 7.65116882324219 153.769238996538 11.1392440795898 153.832549266069 10.6513290405273 153.871194687431 15.4863815307617 153.93441710211 12.716682434082 153.967132805929 9.68875122070312 154.059091006073 249.80126953125 154.119863921841 12.1519012451172 154.167065513166 8.40129089355469 154.204067041253 10.1253128051758 154.238483539865 8.60770416259766 154.287281503401 8.63239288330078 154.354295312583 9.11392211914062 154.442366571632 12.802001953125 154.458579268121 13.6596527099609 154.516618876556 7.53665161132812 154.597967058099 9.56630706787109 154.741877511563 7.60279083251953 154.867932118902 8.80154418945312 155.057186257394 33.417724609375 161.132132245247 39.7950744628906 171.066157321442 184.842407226562 172.068898882286 15.1898727416992 203.179434771839 439.214111328125 204.185131411275 77.8417358398438 205.195457121799 71.0940551757812 206.199091397854 13.1645584106445 213.076358931849 413.51806640625 214.081604128092 61.6261596679688 221.19191080542 56.7674255371094 225.150462624481 28.5277709960938 253.145769250348 30.3797454833984 271.017300561015 8.10126495361328 273.097346889741 1529.84765625 273.915130615234 16.1784820556641 274.104743725598 241.557006835938 274.313159180022 8.10126495361328 274.532276354022 8.10126495361328 274.801666259482 13.1645584106445 275.104747385923 55.3226013183594 275.667171499961 8.93072509765625 276.108852648378 7.22447967529297 327.01580132435 10.126579284668 335.223267414442 40.77490234375 375.12562966097 16.0544128417969 387.044061882768 31.8549346923828 388.041061401409 7.08860778808594 389.049219385066 13.3026580810547 390.044051096872 11.0856628417969 391.037421516486 13.8551330566406 429.045277913892 19.2405090332031 435.149311018644 180.959106445312 436.155138304358 42.5316467285156 437.156342045052 20.2531585693359 447.062735513126 41.5189819335938 448.059813189891 15.1898727416992 449.059782854491 33.7029724121094 451.053728262932 10.126579284668 459.063733139018 31.2515411376953 461.053361060316 10.3263168334961 489.076601895619 22.2784881591797 507.073951707729 14.4237823486328 513.172417463455 14.4203948974609 567.197453937355 62.7848205566406 568.195965237054 16.2025299072266 585.196779097862 16.6740417480469 637.323992485113 67.8480834960938 638.326947745576 33.417724609375 639.126516966879 73.3114013671875 641.128830943476 37.8892211914062 641.152369988391 32.8050537109375 647.219895535169 44.4820861816406 647.724347795395 13.1645584106445 648.213351767236 14.8246917724609 655.333107352198 220.65283203125 656.334681209509 95.7450561523438 677.236541863835 16.2025299072266 705.256645257799 22.1676177978516 707.238495741311 552.73828125 708.244755768619 203.0302734375 709.247389400702 114.836059570312 710.254508101981 26.5077514648438 712.293399741442 24.42236328125 713.244267236952 65.3881225585938 713.739835267102 55.22412109375 714.247355511125 82.9747314453125 714.746865060184 60.5936584472656 715.241382631612 49.7239685058594 715.734835319859 46.4164428710938 716.232969030739 10.9199066162109 716.753662479995 12.1519012451172 717.251519725418 38.7474060058594 727.313944260083 11.7802963256836 769.381899986102 25.3446044921875 787.373217106504 70.6663208007812 788.387105843535 32.4050598144531 805.408736998906 13.9015274047852 833.280428532967 24.7292022705078 839.285756703998 37.0708618164062 840.282628786684 15.9805755615234 841.288116921971 42.3271179199219 842.293898423669 16.7073364257812 843.318412584578 28.8607635498047 869.296843485573 453.271728515625 870.298953955147 186.107055664062 871.310912499797 67.4124145507812 872.321418289423 24.3453369140625 887.298386765163 27.2509918212891 909.414671463409 25.3164520263672 919.424468858016 16.1927032470703 927.423522949219 35.0611572265625 927.437877308387 40.5063171386719 928.426100877904 23.3436737060547 1001.33264226971 40.2743530273438 1020.35634155276 16.2025299072266 1071.47793216169 15.6688461303711 1072.48083496094 20.0883941650391 1089.48340305908 153.455444335938 1090.479354639 93.2432861328125 1091.4862340367 40.8655395507812 1106.44499180906 12.8054504394531 1107.48572357247 76.3225708007812 1108.4899059212 55.7180480957031 1124.52558032729 13.0770568847656 1160.38052034597 39.3897094726562 1160.39770507812 37.4783630371094 1161.39329244002 26.2115325927734 1221.51934349623 61.9869384765625 1222.29489839963 15.1898727416992 1222.49035644531 30.8860778808594 1222.53234863281 32.4050598144531 1223.54123535173 14.1772155761719 1239.5293695344 90.3782958984375 1240.53419873005 66.0807495117188 1241.52443343886 29.3670959472656 1256.55279828367 24.3453826904297 1292.43559755503 54.3342895507812 1293.45704348615 59.1096496582031 1294.4746067796 28.8607635498047 1332.48965731506 29.3670959472656 1332.53833007812 21.2658233642578 1333.01085486806 21.2658233642578 1333.50175548727 22.2784881591797 1333.9893951729 14.3023986816406 1353.57559841698 19.1224975585938 1354.55534323493 14.1149215698242 1371.57709521761 82.7237548828125 1372.58424410134 50.5753784179688 1373.57772084761 20.9331665039062

kaibioinfo commented 3 years ago

Hi,

I think compounds above 1000 Da (and with so many high mass fragments) are too much for the ILP. I would suggest that we compute high mass compounds with the heuristic. Yes, computing the exact solution is nice, but it is ridiculous that 99% of the running time of one analysis is spent on a few high mass compounds :/

nirshahaf commented 3 years ago

Hi Kai,

I'm not sure about the mentioned heuristic, but I found a surprising outcome when testing in the NI mode (full spectra below). The NI mode has less fragments in this case and I expected Sirius to conclude much faster - however it hasn't and after roughly an hour of calculating I stopped it and looked at the generated spectra and trees - where I did not find the correct formula (N=10). I then truncated the original .ms file into to versions: one with just the two highest mass fragments (+isotopes) and another with the remaining two lower mass fragments (one without detected isotopes). I then run Sirius using the same parameters on each truncated input and was glad to notice that in both cases it completed normally within a couple of minutes AND with the correct formula identified either in rank 1 (the lower mass fragments) or rank 2 (the higher mass fragments). I therefore think that there is something funky going on and that you might want to add some heuristic which for the >1000 or >900 Da compounds would truncate the MS2 peaks in a more rational way than I did and leave just sufficient data for the algorithm to converge efficiently.

Here is the full input .ms file (same compound as in the first thread):

compound NP-008069 formula C61H94O34 parentmass 1415.56745241152 ionization [M+FA-H]-

ms1 1369.5620664683 2609.126953125 1370.10416027068 18.6966857910156 1370.56446981945 1801.8505859375 1371.56618519547 844.20751953125 1372.08129882812 21.1581573486328 1372.57047210081 294.842041015625 1373.27502441406 19.445068359375 1373.5701012032 100.971496582031 1373.60245745325 42.7364807128906 1374.27172851562 22.0960540771484 1374.55733493081 31.4545288085938 1374.89514160156 22.106689453125 1375.24062502631 16.9260559082031 1375.55367487289 14.6835479736328 1405.54140239552 70.5264282226562 1406.53388919922 42.8983764648438 1407.52470046546 41.3442077636719 1408.56120447965 28.3544311523438 1415.56745241152 2769.6484375 1416.56816971123 1913.974609375 1417.57561788284 903.10400390625 1418.23605588669 21.3117828369141 1418.57918878837 339.12109375 1418.92863923066 24.0499725341797 1419.56505776432 50.0323791503906 1419.58984662733 113.657287597656 1419.95555610225 24.2431182861328

ms2 723.178848771134 44.1402587890625 911.499147031881 588.01171875 912.502626093085 265.8583984375 913.513163759151 99.551025390625 914.502358631551 29.5118255615234 1140.4976410677 11.6455688476562 1311.52404785156 84.174560546875 1313.58103594598 16.0462646484375 1328.54187990961 113.484313964844 1329.55023359777 59.5953063964844 1330.58057369567 21.9123382568359 1371.18811035156 81.0126342773438 1372.12414550781 40.5063171386719 1372.56727934847 850.6328125 1372.87884447927 81.0126342773438 1373.29724121094 27.4158782958984 1373.57458496094 171.71630859375 1373.89599609375 31.0443572998047 1374.11962890625 27.3267669677734 1374.58142089844 49.1774597167969 1375.13537597656 18.7820587158203 1375.31176757812 17.7598571777344 1376.44311523438 18.0995178222656 1377.12356556204 14.66064453125 1391.51037597656 23.0181121826172

nirshahaf commented 3 years ago

BTW,

If you need to retrain some model on higher mass compounds, I can generate 275 spectra of chemical standards with a mass range between 1001 and 1964 Da. I've taken Sebastian's comment from a few years ago and have increased the sensitivity of the peak extraction method - i.e. there are more low-mass putative fragments now.