MSnbase data processing problem

FaKeTaoT commented 1 year ago

After we used readMGFData function to read our mgf files and get the heads of the data by fData(mgf), there is a mismatch between how the data is processed and the actual data.

The head(fData(mgf)) gives:

                                                                                                            COMMENT NUM_PEAKS
X1                           Annotation level-1; PlaSMA ID-2558; ID title-Withanone; Max plant tissue-Standard only        10
X10                     Annotation level-1; PlaSMA ID-3304; ID title-Ginsenoside F3; Max plant tissue-Standard only        61
X100  Annotation level-1; PlaSMA ID-2997; ID title-Eriodictyol-7-O-neohesperidoside; Max plant tissue-Standard only        63
X1000                                                                       Unknown (carbon number 6); PlaSMA ID-73  127.0391
X1001                                                                       Unknown (carbon number 6); PlaSMA ID-76  127.0394
X1002                                                                       Unknown (carbon number 6); PlaSMA ID-77  127.0401

However, the actual mgf file has the correct data for PlaSMA ID-73 as:

BEGIN IONS
FORMULA=C6H6O3
CCS=-1
IONMODE=positive
COMMENT=Annotation level-4; PlaSMA ID-73; ID title-GM_LeafStem_Pos-30; Max plant tissue-MT_Root_Pos
NUM_PEAKS=4
COMPOUND_NAME=Unknown (carbon number 6); PlaSMA ID-73
PRECURSOR_MZ=127.0391
ADDUCT=[M+H]+
COMPOUND_CLASS=Unknown
RETENTION_TIME=3.17
96.95898 19.0
109.97094 20.0
127.03698 42.0
127.04676 26.0
END IONS

This mismatch problem happens in multiple mgf data processing. We think it's a problem with how the data was processed by the MSnbase. Thanks.

lgatto commented 1 year ago

Could you please describe what the problem is. You should the first lines of the feature Data and the content of an mgf file, that isn't part of the first few elements above.

Also, I would suggest you change you pipelines to make use of Spectra and MsBackendMgf. Not further developments will be done in MSnbase.

FaKeTaoT commented 1 year ago

Dear @lgatto, I would like to provide further details on the issue I previously mentioned concerning the readMGFData function's handling of MGF files with non-uniform column alignments. For example, consider the following example from an MGF file where the first spectrum includes an additional "NAME" column not present in subsequent spectra:

BEGIN IONS
NAME=Piperine; CE0; MXXWOMGUGJBKIW-YPCIICBESA-N
IONMODE=positive
FORMULA=C17H19NO3
SMILES=C1CCN(CC1)C(=O)/C=C/C=C/c1ccc2c(c1)OCO2
INCHIKEY=MXXWOMGUGJBKIW-YPCIICBESA-N

In contrast, the following spectra do not contain the "NAME" column:

BEGIN IONS
IONMODE=positive
FORMULA=C17H19NO3
SMILES=C1CCN(CC1)C(=O)/C=C/C=C/c1ccc2c(c1)OCO2
INCHIKEY=MXXWOMGUGJBKIW-YPCIICBESA-N

After processing this file with readMGFData("KI-GIAR_zic-HILIC_Pos_v0.90.mgf") and subsequently invoking head(fData(KI_data)) in R, it becomes evident that the data alignment is compromised, leading to a mismatch of information in the columns. Here is the output:

> KI_data <- readMgfData("KI-GIAR_zic-HILIC_Pos_v0.90.mgf")
> head(fData(KI_data))
                                          X.NAME   IONMODE                FORMULA                                  SMILES
X1    Piperine; CE0; MXXWOMGUGJBKIW-YPCIICBESA-N  positive              C17H19NO3 C1CCN(CC1)C(=O)/C=C/C=C/c1ccc2c(c1)OCO2
X10                                     positive   C6H6N2O         CC(=O)c1cnccn1             DBZAKQWXICEWNW-UHFFFAOYSA-N
X100                                    positive  C7H8N4O2 Cn1cnc2c1c(nc(=O)n2C)O             YAPQBXQYLJRXSA-UHFFFAOYSA-N
X1000                                   positive C11H14N2S CN1CCCN=C1/C=C/c1cccs1             YSAUAVHXTIETRK-AATRIKPKSA-N
X1001                                   positive   C5H7NO3 C1CC(=N[C@@H]1C(=O)O)O             ODHCTXKNWHHXJC-VKHMYHEASA-N
X1002                                   positive   C5H7NO3 C1CC(=N[C@@H]1C(=O)O)O             ODHCTXKNWHHXJC-VKHMYHEASA-N
                                                                                    INCHIKEY
X1                                                               MXXWOMGUGJBKIW-YPCIICBESA-N
X10                                          InChI=1S/C6H6N2O/c1-5(9)6-4-7-2-3-8-6/h2-4H,1H3
X100             InChI=1S/C7H8N4O2/c1-10-3-8-5-4(10)6(12)9-7(13)11(5)2/h3H,1-2H3,(H,9,12,13)
X1000 InChI=1S/C11H14N2S/c1-13-8-3-7-12-11(13)6-5-10-4-2-9-14-10/h2,4-6,9H,3,7-8H2,1H3/b6-5+
X1001               InChI=1S/C5H7NO3/c7-4-2-1-3(6-4)5(8)9/h3H,1-2H2,(H,6,7)(H,8,9)/t3-/m0/s1
X1002               InChI=1S/C5H7NO3/c7-4-2-1-3(6-4)5(8)9/h3H,1-2H2,(H,6,7)(H,8,9)/t3-/m0/s1

This pattern of misalignment due to variable column presence recurs throughout the file and seems to affect multiple records. Would you be able to suggest a solution or a workaround to ensure that all data is parsed into the correct columns, regardless of such inconsistencies within the MGF file? Any insight you can provide would be greatly appreciated.

Best, Tony

mgf_sample.zip

lgatto commented 1 year ago

Thank you for the clarification. There is an issue indeed - readMgfData() assumes that all header are identical, which isn't the case in your example.

The data seem to be parsed correctly with Spectra and MsBackendMgf, referenced above:

> library(Spectra)
> library(MsBackendMgf)
> f <- "~/Downloads/mgf_sample/KI-GIAR_zic-HILIC_Pos_v0.90.mgf"
> sps <- Spectra(f, source = MsBackendMgf())
Start data import from 1 files ... done
> spectraData(sps)[, c("X.NAME", "IONMODE", "FORMULA")]
DataFrame with 814 rows and 3 columns
                    X.NAME     IONMODE     FORMULA
               <character> <character> <character>
1   Piperine; CE0; MXXWO..    positive   C17H19NO3
2                       NA    positive   C17H19NO3
3                       NA    positive   C17H19NO3
4                       NA    positive   C17H19NO3
5                       NA    positive   C17H19NO3
...                    ...         ...         ...
810                     NA    positive  C14H18N2O5
811                     NA    positive  C14H18N2O5
812                     NA    positive  C14H18N2O5
813                     NA    positive  C14H18N2O5
814                     NA    positive  C14H18N2O5

I'll see what I can do to address this in MSnbase, either fix it, or throw an error, but my advice is to use the more recent infrastructure.

lgatto commented 1 year ago

An error is now thrown when headers aren't consistent (version 2.29.1 and 2.28.1):

> readMgfData(f)
Error in readMgfData(f) : "Ion headers identical." is not TRUE

lgatto / MSnbase

MSnbase data processing problem #597