reflectivity / file_format

2 stars 4 forks source link

MAINT: fix validation against test_example.ort #10

Closed andyfaff closed 2 years ago

andyfaff commented 2 years ago

@bmaranville this updated schema is able to validate the test_example.ort at https://www.reflectometry.org/projects/file_formats/tasks/ws_2021-06_text/.

The yaml schema is not updated.

Note that there are several outstanding aspects that are missing from the schema:

andyfaff commented 2 years ago

I've added a bit to validate the columns, but it's a simple array of name, unit, dimension. I don't know how to specify 4 or more columns with the first 4 columns being Qz/R/[sR/[sQz]]

bmaranville commented 2 years ago

I've added a bit to validate the columns, but it's a simple array of name, unit, dimension. I don't know how to specify 4 or more columns with the first 4 columns being Qz/R/[sR/[sQz]]

This gets pretty complicated with Python typing: I think you can specify the 2, 3, or 4 column case but it's hard (not possible?) to specify that there can be 5 or more columns with the first four fixed and any additional ones free. (see https://github.com/python/typing/issues/692)

It's an interesting case where the jsonschema supports this just fine and is straightforward to write: you can give subschemas for each item in an array in jsonschema, and you can add the "additionalItems" specification that all extra items after that have to be "string" (or any other constraint). https://json-schema.org/understanding-json-schema/reference/array.html#additional-items

Python type for 2, 3 or 4 columns:

Union[
    Tuple[Literal["Qz"], Literal["R"]], 
    Tuple[Literal["Qz"], Literal["R"], Literal["sR"]], 
    Tuple[Literal["Qz"], Literal["R"], Literal["sR"], Literal["sQz"]]
] 
andyfaff commented 2 years ago

Gah, complicated. I like the idea of generating the schema from the class structure, it's just that much more robust to start with. I've gone with:

@dataclass
class qz_column(column):
    name: Literal['Qz']

@dataclass
class R_column(column):
    name: Literal['R']

@dataclass
class ORSOHeader:
    creator: Creator
    data_source: DataSource
    columns: Tuple[qz_column, R_column, column, column]
    reduction: Optional[Reduction] = None

If this isn't going to work in the long run then we can always patch the schema at a later date. We'll probably have to have subschema anyway because there can be multiple datasets within a single file, so we'll need a schema solely for dataset+data_source.