meowcat / MSnio

3 stars 3 forks source link

Common function to map from input fields to common fields #6

Open jorainer opened 5 years ago

jorainer commented 5 years ago

Please correct me if I got this wrong: the idea is to have map data from different input sources to a commonly agreed set of fields and an object that can hold this data. So, the workflow would be: 1) read input file. 2) map names of the input file to commonly accepted names. 3) put that into a result object.

So, 1) would be an input type specific function and its result should be a named list of the file's elements. 2) uses the schema for the mapping, hence, this could be a single function for all parsers, right? 3) this one would also be a single function as I see it.

meowcat commented 5 years ago

Input and output, importantly. Otherwise I think you are correct.

I actually envisioned it slightly differently: 1) read input file, 2) map into a result (Spectrum/Spectra/...) object using the corresponding formats' nomenclature/sytem/hierarchy, 3) map the Spectrum/Spectra/result object with custom names to a Spectrum/Spectra/result object with common names. Your workflow is more consistent because mine requires processing the actual peaks separately from / before all other information. It removes an intermediate that I think of as useful, but maybe I can figure out how to work without it.

jorainer commented 5 years ago

Do you have already a function that converts the names provided by the input file to the common names using the schema?

I think that function will be a key one that we need - it should also be fast, if possible.

meowcat commented 5 years ago

We are not as quickly progressing here, unfortunately, since I have to fit this work into my regular work somehow. Also my first implementation will certainly not be a fast one.

jorainer commented 5 years ago

No prob. Was not sure if I just overlooked that one.

Treutler commented 5 years ago

Please keep in mind that there are multiple field names for the same value in case of (at least) Nist .msp and Bruker .library. E.g. the instrument in the NIST*.msp format can be

I encoded this in the table as Instrument / Synon: $:07 / Comments: instrument. Accordingly, we have to (i) support these different flavors for the import and (ii) decide which flavor to export.

meowcat commented 5 years ago

Accordingly, we have to (i) support these different flavors for the import and (ii) decide which flavor to export.

(i) could be feasible by doing something like this:

- field: Synon
  node:
   - field: $:70
     map_read: instrument

or map: instrument, type: readonly. There will also be cases of nested mapping, where a sub-entry in one record format is a toplevel entry in general (e.g. possibly INCHIKEY depending on how we define it.)

(ii) I guess every schema needs to choose a canonical export format.

Treutler commented 5 years ago

(ii) I guess every schema needs to choose a canonical export format.

Agreed. I adjusted the fields in the table so that the first field is meant to be the canonical export format.