sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
235 stars 32 forks source link

read_plink returns bytes for variant_alleles not unicode #1209

Open jeromekelleher opened 8 months ago

jeromekelleher commented 8 months ago

There's no good reason for returning bytes rather than utf8 unicode strings I think --- it can only lead to bugs in user code and inconsistencies in string handling (anyone remember Python 2???)

This is based on the "example" plink dataset in the test suite

       sg_ds = sgkit.io.plink.read_plink(path=path)
        print(sg_ds.variant_allele.values)
        print(sg_ds.variant_allele)

Gives

[[b'A' b'G']
 [b'T' b'C']]
<xarray.DataArray 'variant_allele' (variants: 2, alleles: 2)>
dask.array<astype, shape=(2, 2), dtype=|S1, chunksize=(2, 1), chunktype=numpy.ndarray>