Bug in metadata reading and spectral matching

hechth commented 8 months ago

When reading spectra from the files in the attached archive, multiple things go wrong.

Firstly, some metadata is not read correctly (missing and inserted as NA) and also the individual entries end up in the wrong places, so the InChIKey from spectrum 2 is assigned to spectrum 1 and InChIKey of spectrum 2 is then NA.

new("Spectra", backend = new("MsBackendMsp", spectraData = new("DFrame", 
    rownames = NULL, nrows = 2L, elementType = "ANY", elementMetadata = NULL, 
    metadata = list(), listData = list(name = c("2,2',3,4',5,5'-Hexachloro-4-methoxybiphenyl", 
    "Pendimethalin"), RETENTION_TIME = c("None", "None"), RETENTION_INDEX = c("2554.1", 
    "2044.6"), PRECURSOR_MZ = c("387.85245", "281.13574"), ADDUCT = c("[M]+", 
    "[M]+"), COLLISION_ENERGY = c("70eV", "70eV"), INSTRUMENT_TYPE = c("GC-EI-Orbitrap", 
    "GC-EI-Orbitrap"), NUM.PEAKS = c("164", "86"), SCANNUMBER = c("-1", 
    NA), SPECTRUMTYPE = c("Centroid", NA), formula = c("C13H19N3O4", 
    NA), inchikey = c("CHIFOSRWCNZCFN-UHFFFAOYSA-N", NA), smiles = c("CCC(CC)NC1=C(C=C(C(=C1[N+](=O)[O-])C)C)[N+](=O)[O-]", 
    NA), AUTHORS = c("Price et al., RECETOX, Masaryk University (CZ)", 
    NA), instrument = c("Q Exactive GC Orbitrap GC-MS/MS", NA
    ), IONIZATION = c("EI+", NA), LICENSE = c("CC BY-NC", NA), 
        mz = new("SimpleNumericList", elementType = "numeric", 
            elementMetadata = NULL, metadata = list(), listData = list(

Also, during matching, not all scores are calculated or they are listed as NA.

The code used for matching is the following:

data_reference <- Spectra(reference_file, source = MsBackendMsp::MsBackendMsp())
data_simulated <- Spectra(simulated_file, source = MsBackendMsp::MsBackendMsp())

# Define match parameters
match_param <- MetaboAnnotation::MatchForwardReverseParam(
  requirePrecursor = FALSE,
  ppm = ppm, 
  FUN = MsCoreUtils::ndotproduct, 
  THRESHFUN = function(x) which(x >= 0.0), 
  THRESHFUN_REVERSE = function(x) which(x >= 0.0)
)

# Perform matching
matched_spectra <- MetaboAnnotation::matchSpectra(data_simulated, data_reference, match_param)

# Convert matched spectra to data frame
matched_spectra_df <- spectraData(matched_spectra, columns = c("name", "target_name", "reverse_score", "score", "presence_ratio", "matched_peaks_count"))
matched_spectra_df <- as.data.frame(matched_spectra_df)

Also, to actually get the 0 scores, the threshold functions have to be extended with | TRUE because 0 scores seem to be represented as NA or so.

problematic.zip

jorainer commented 8 months ago

OK, so seems there are several problems. I will look into it, thanks for reporting.

hechth commented 8 months ago

The bug with the metadata reading and missing scores is very bizarre and I also have no idea. Are spectra somehow read in batches or so?

jorainer commented 8 months ago

Looks like we have problems with the msp files you provided. Are these in "standard" format? I have trouble finding a proper definition of the file format - NIST however defines that each spectrum has to start with NAME: - in your case the NAME field is not the first line per spectrum, so, all elements before that line get assigned to the previous spectrum. I could add a fix for that splitting by empty lines instead of NAME elements.

The other problem is the peak list - you have in addition to the 2 elements per row (m/z and intensity) also sometimes a third element with annotation. That is at present not properly handled. I could add support for that, but would be nice to have some reference/format definition.

could well be that the problem you see later with the scores is related to the problem that the peak values are not correctly handled.

hechth commented 8 months ago

There is no proper definition of the MSP file format :D

I would advise against trying to fix it because you will run into the same issues as we do with matchms where you have to support a million flavours of MSP - maybe I can just convert the spectra to NIST format and then force NAME to be the first row on NIST - how does spectra deal with it if there is no NAME present?

jorainer commented 8 months ago

if there is no NAME it will consider the full content of a msp file as being one single spectrum... we're essentially splitting by NAME. but the thing is we could split by whitespace instead. which would then not require any specific order of elements.

A bigger problem for now is the 3rd column of the peaks data. I will have to think how to support that (it makes sense to also provide peak annotations if available...)

hechth commented 8 months ago

if there is no NAME it will consider the full content of a msp file as being one single spectrum... we're essentially splitting by NAME. but the thing is we could split by whitespace instead. which would then not require any specific order of elements.

A bigger problem for now is the 3rd column of the peaks data. I will have to think how to support that (it makes sense to also provide peak annotations if available...)

I'm not sure if this makes sense. I'd rather see an R implementation of mzSpecLib and abandon MSP files all together - nothing is standardized etc. - we can remove the comments with matchms, that is already implemented - so overall, will try to minimize the msp and remove comments and switch to NIST format.

jorainer commented 8 months ago

maybe mgf would be more standardized as an alternative?

hechth commented 8 months ago

Yeah this is also something to try

jorainer commented 8 months ago

Anyway. We need to at least throw an error or similar if we get encounter an unexpected MSP format.

jorainer commented 8 months ago

I have a PR in MsBackendMsp that fixes the issues we have with your MSP files. With that new version it would be possible to properly handle and read your files. PR: https://github.com/rformassspectrometry/MsBackendMsp/pull/15

rformassspectrometry / MetaboAnnotation

Bug in metadata reading and spectral matching #109