sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
23 stars 5 forks source link

Char fields added as Unicode not string #268

Closed jeromekelleher closed 2 weeks ago

jeromekelleher commented 2 weeks ago

Per the spec, Char fields should have dtype |S1, but we are currently outputting "<U1", e.g.

cat field_type_combos.vcf.vcz/variant_IC1/.zarray 
{
    "chunks": [
        10000
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 7,
        "cname": "zstd",
        "id": "blosc",
        "shuffle": 0
    },
    "dimension_separator": "/",
    "dtype": "<U1",
    "fill_value": null,
    "filters": null,
    "order": "C",
    "shape": [
        208
    ],
    "zarr_format": 2
}
jeromekelleher commented 2 weeks ago

Ah yes - it's not actually clear we want to do this: https://github.com/sgkit-dev/vcf-zarr-spec/issues/14

I think it would be simpler if we standardised on UTF8 going forward, so I'm going to close this as a "wonfix"