meowcat / MSnio

3 stars 3 forks source link

Common schema format #7

Closed jorainer closed 5 years ago

jorainer commented 5 years ago

Hi all,

just wondered if it would not make sense to define the schemas differently. Currently the schema_massbank_auto.yaml is defined as:

metadata:
- field: ACCESSION
  map: accession
- field: RECORD_TITLE
  map: title
- field: DATE
  map: record_date
- field: AUTHORS
  map: authors
- field: LICENSE
  map: license
- field: COPYRIGHT
  map: copyright
- field: PUBLICATION
  map: publication

And the fields.yaml as

- field: accession
  format: string
  cardinality: '1'
- field: title
  format: string
  cardinality: '1'
  ontology: MS:1000796
- field: record_date
  format: date
  cardinality: '1'

Wouldn't it be more intuitive (and easier later for the mapping) to have it in the form:

metadata:
- field: accession
  original: ACCESSION
- field: title
  original: RECORD_TITLE

It's not a big thing, but I guess for new users/definitions of schemas it might be easier to just copy the fields.yaml and add an original value to it.

meowcat commented 5 years ago

Currently the schema_massbank_auto.yaml is organized such as to represent a blueprint for the parser, such that its structure follows the structure of the MassBank record. See the node entries, e.g.

- field: AC$MASS_SPECTROMETRY
  rule: block
  node:
  - field: MS_TYPE
    map: ms_level
  - field: ION_MODE
    map: ion_mode
  - field: COLLISION_ENERGY
    map: collision_energy

This is more intuitive for writing the record specification at least for the text-format records. We have to see if such a specification also works for e.g. Agilent CEF which is XML-based. I think it should work but I'm not sure.

If we turn it the other way round, we would instead specify:

metadata:
  - field: ms_type
    original: MS_TYPE
    parent: AC$MASS_SPECTROMETRY

But then we would have to specify a dummy AC$MASS_SPECTROMETRY somewhere since this is just a node that isn't really mapped to a field, and we have to figure out how the record has to be ordered. So because of that I think the current way of specifying it has some advantages.

However, the schema_massbank_auto.yaml is still a first sketch (not just the MassBank specification, but also the syntax) and the fields.yaml is still completely unused.

jorainer commented 5 years ago

OK, totally fine.

meowcat commented 5 years ago

But you are highlighting a possible issue. The schema_massbank_auto.yaml is currently doing two things; it is 1) defining the structure of the record, which is used by your "function 1" (#6) and 2) defining the mapping, which is used by your "function 2". Do you think this is good, or should this be separated?

My goal was to define most of the record structure in schema, so the parser would be as free as possible from "business logic". Still any schema (containing "rules") will only work with a parser that understands these rules. It would be cool if it works out well, but maybe it won't.

jorainer commented 5 years ago

just realized that myself. That's why I need to implement an importer by myself to understand what's going on and how to best achieve what I want.