Closed FaKeTaoT closed 1 year ago
Could you please describe what the problem is. You should the first lines of the feature Data and the content of an mgf file, that isn't part of the first few elements above.
Also, I would suggest you change you pipelines to make use of Spectra and MsBackendMgf. Not further developments will be done in MSnbase.
Dear @lgatto, I would like to provide further details on the issue I previously mentioned concerning the readMGFData function's handling of MGF files with non-uniform column alignments. For example, consider the following example from an MGF file where the first spectrum includes an additional "NAME" column not present in subsequent spectra:
BEGIN IONS
NAME=Piperine; CE0; MXXWOMGUGJBKIW-YPCIICBESA-N
IONMODE=positive
FORMULA=C17H19NO3
SMILES=C1CCN(CC1)C(=O)/C=C/C=C/c1ccc2c(c1)OCO2
INCHIKEY=MXXWOMGUGJBKIW-YPCIICBESA-N
In contrast, the following spectra do not contain the "NAME" column:
BEGIN IONS
IONMODE=positive
FORMULA=C17H19NO3
SMILES=C1CCN(CC1)C(=O)/C=C/C=C/c1ccc2c(c1)OCO2
INCHIKEY=MXXWOMGUGJBKIW-YPCIICBESA-N
After processing this file with readMGFData("KI-GIAR_zic-HILIC_Pos_v0.90.mgf")
and subsequently invoking head(fData(KI_data))
in R, it becomes evident that the data alignment is compromised, leading to a mismatch of information in the columns. Here is the output:
> KI_data <- readMgfData("KI-GIAR_zic-HILIC_Pos_v0.90.mgf")
> head(fData(KI_data))
X.NAME IONMODE FORMULA SMILES
X1 Piperine; CE0; MXXWOMGUGJBKIW-YPCIICBESA-N positive C17H19NO3 C1CCN(CC1)C(=O)/C=C/C=C/c1ccc2c(c1)OCO2
X10 positive C6H6N2O CC(=O)c1cnccn1 DBZAKQWXICEWNW-UHFFFAOYSA-N
X100 positive C7H8N4O2 Cn1cnc2c1c(nc(=O)n2C)O YAPQBXQYLJRXSA-UHFFFAOYSA-N
X1000 positive C11H14N2S CN1CCCN=C1/C=C/c1cccs1 YSAUAVHXTIETRK-AATRIKPKSA-N
X1001 positive C5H7NO3 C1CC(=N[C@@H]1C(=O)O)O ODHCTXKNWHHXJC-VKHMYHEASA-N
X1002 positive C5H7NO3 C1CC(=N[C@@H]1C(=O)O)O ODHCTXKNWHHXJC-VKHMYHEASA-N
INCHIKEY
X1 MXXWOMGUGJBKIW-YPCIICBESA-N
X10 InChI=1S/C6H6N2O/c1-5(9)6-4-7-2-3-8-6/h2-4H,1H3
X100 InChI=1S/C7H8N4O2/c1-10-3-8-5-4(10)6(12)9-7(13)11(5)2/h3H,1-2H3,(H,9,12,13)
X1000 InChI=1S/C11H14N2S/c1-13-8-3-7-12-11(13)6-5-10-4-2-9-14-10/h2,4-6,9H,3,7-8H2,1H3/b6-5+
X1001 InChI=1S/C5H7NO3/c7-4-2-1-3(6-4)5(8)9/h3H,1-2H2,(H,6,7)(H,8,9)/t3-/m0/s1
X1002 InChI=1S/C5H7NO3/c7-4-2-1-3(6-4)5(8)9/h3H,1-2H2,(H,6,7)(H,8,9)/t3-/m0/s1
This pattern of misalignment due to variable column presence recurs throughout the file and seems to affect multiple records. Would you be able to suggest a solution or a workaround to ensure that all data is parsed into the correct columns, regardless of such inconsistencies within the MGF file? Any insight you can provide would be greatly appreciated.
Best, Tony
Thank you for the clarification. There is an issue indeed - readMgfData()
assumes that all header are identical, which isn't the case in your example.
The data seem to be parsed correctly with Spectra
and MsBackendMgf
, referenced above:
> library(Spectra)
> library(MsBackendMgf)
> f <- "~/Downloads/mgf_sample/KI-GIAR_zic-HILIC_Pos_v0.90.mgf"
> sps <- Spectra(f, source = MsBackendMgf())
Start data import from 1 files ... done
> spectraData(sps)[, c("X.NAME", "IONMODE", "FORMULA")]
DataFrame with 814 rows and 3 columns
X.NAME IONMODE FORMULA
<character> <character> <character>
1 Piperine; CE0; MXXWO.. positive C17H19NO3
2 NA positive C17H19NO3
3 NA positive C17H19NO3
4 NA positive C17H19NO3
5 NA positive C17H19NO3
... ... ... ...
810 NA positive C14H18N2O5
811 NA positive C14H18N2O5
812 NA positive C14H18N2O5
813 NA positive C14H18N2O5
814 NA positive C14H18N2O5
I'll see what I can do to address this in MSnbase
, either fix it, or throw an error, but my advice is to use the more recent infrastructure.
After we used readMGFData function to read our mgf files and get the heads of the data by fData(mgf), there is a mismatch between how the data is processed and the actual data.
The
head(fData(mgf))
gives:However, the actual mgf file has the correct data for PlaSMA ID-73 as:
This mismatch problem happens in multiple mgf data processing. We think it's a problem with how the data was processed by the MSnbase. Thanks.