statisticalbiotechnology / maracluster

Matthew The's implementation of MaRaCluster
Apache License 2.0
11 stars 3 forks source link

usi supported as id #26

Closed ypriverol closed 1 year ago

ypriverol commented 1 year ago

@percolator @MatthewThe

I'm trying to use maracluster to cluster billions of spectra. One problem I found is that we have multiple files, and we would like to use usi (https://www.nature.com/articles/s41592-021-01184-6) as identifier of the spectrum in the mgf and then get back the report from maracluster instead that with the index with the usi.

This is how a USI looks like in an MGF:

BEGIN IONS
TITLE=id=mzspec:PXD001924:20140106_52_mlplus_tm3:index:10371,sequence=KWDLGDIVAAR/2
PEPMASS=622.344787597656
CHARGE=2.0+
   595.570  2949.085
   645.291  527.688
   369.346  276.108
   276.277  94.888
  1059.633  35.399
   525.239  212.923
   621.109  8.745
   185.405  132.694
   609.847  2439.161
   800.585  620.666
  1104.713  49.088
   924.684  24.219
   388.354  269.499
   668.601  448.725
   451.616  441.908
   191.218  12.017
   824.108  1919.122
  1083.648  38.183
   444.346  111.252
   744.399  1488.798
  1028.557  160.071
   782.752  429.679
   458.326  405.505
   775.522  269.681
   760.675  420.993
   347.277  213.037
   190.150  55.043
   499.230  1344.165
   509.365  933.466
  1045.479  50.976
   405.413  178.342
   891.637  108.553
   642.355  436.559
   518.248  287.477
   837.390  624.936
   447.333  761.800
   474.377  386.830
   702.498  1712.772
   520.684  696.569
   783.498  811.340
   311.261  85.109
   911.617  517.818
   588.286  3341.009

The id will be for this spectrum mzspec:PXD001924:20140106_52_mlplus_tm3:index:10371.

Do you think you can support this in MaraCluster?

MatthewThe commented 1 year ago

Thanks for the suggestion!

It would indeed be a nice addition to propagate spectrum identifiers to the output format. I think it shouldn't be too difficult, I will have a look.

MatthewThe commented 1 year ago

Hi Yasset,

I have started working on this. Would it be fine for you if we just return the entire title, e.g. id=mzspec:PXD001924:20140106_52_mlplus_tm3:index:10371,sequence=KWDLGDIVAAR/2 Implementation wise, this is a bit easier.

ypriverol commented 1 year ago

go for it.

MatthewThe commented 1 year ago

I added a command line argument --addSpecIds, which now adds the spectrum id/title as a fourth column to the clustering output. I will create a new release once all the builds have passed.

ypriverol commented 1 year ago

@MatthewThe @percolator let me know when the release is done.

MatthewThe commented 1 year ago

Sorry for the delay, I released version 1.04 now which includes the new feature.

ypriverol commented 1 year ago

Thanks, I will give it a try. !!!