sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
28 stars 7 forks source link

Reformat schema JSON #123

Closed jeromekelleher closed 7 months ago

jeromekelleher commented 7 months ago

Currently the schema JSON looks like this:

{
    "format_version": "0.2",
    "samples_chunk_size": 1000,
    "variants_chunk_size": 10000,
    "dimensions": [
        "variants",
        "samples",
        "ploidy",
        "alleles",
        "filters"
    ],
    "sample_id": [
        "NA00001",
        "NA00002",
        "NA00003"
    ],
    "contig_id": [
        "19",
        "20",
        "X"
    ],
    "contig_length": null,
    "filter_id": [
        "PASS",
        "s50",
        "q10"
    ],

However, this loses information about things like filter descriptions, and is anyway an odd way of structuring a JSON document. Move to

"format_version": "0.2",
"samples_chunk_size": 1000,
"variants_chunk_size": 10000,
"dimensions": [
    "variants",
    "samples",
    "ploidy",
    "alleles",
    "filters"
],
"samples": [
    {"id": "NA00001"},
    {"id": "NA00002"},
    {"id": "NA00003"}
],
"contigs": [
    {"id": "19"},
    {"id": "20"},
    {"id": "X"}
],
"filters": [
    {"id": "PASS", "description": ""},
    {"id": "s50", "description": "something"},
    {"id": "q10", "description": "something else"}
],

There will probably need to be an update in the ICF metadata format also.

No major hurry with this, but we should do it before an initial public release.

jeromekelleher commented 7 months ago

Perhaps could also make these lists dictionaries keyed by the ID, with a dictionary as the value. That would be more in keeping with the current columns element.

Perhaps this should be renamed to fields or arrays, as this would be less confusing.

jeromekelleher commented 7 months ago

TODO added about IcfMetadata:

# TODO refactor this to have embedded Contig dataclass, Filters
# and Samples dataclasses to allow for more information to be
# retained and forward compatibility.


@dataclasses.dataclass
class IcfMetadata: