sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
231 stars 32 forks source link

Intermediate variables are not returned if merge=False #405

Open tomwhite opened 3 years ago

tomwhite commented 3 years ago
>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=2)
>>> ds
<xarray.Dataset>
Dimensions:             (alleles: 2, ploidy: 2, samples: 2, variants: 10)
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_contig      (variants) int64 0 0 0 0 0 0 0 0 0 0
    variant_position    (variants) int64 0 1 2 3 4 5 6 7 8 9
    variant_allele      (variants, alleles) |S1 b'C' b'G' b'A' ... b'C' b'A'
    sample_id           (samples) <U2 'S0' 'S1'
    call_genotype       (variants, samples, ploidy) int8 0 0 1 0 1 ... 0 1 0 1 1
    call_genotype_mask  (variants, samples, ploidy) bool False False ... False
Attributes:
    contigs:  [0]

>>> sg.count_variant_alleles(ds)
<xarray.Dataset>
Dimensions:               (alleles: 2, ploidy: 2, samples: 2, variants: 10)
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_allele_count  (variants, alleles) uint64 dask.array<chunksize=(10, 2), meta=np.ndarray>
    call_allele_count     (variants, samples, alleles) uint8 dask.array<chunksize=(10, 2, 2), meta=np.ndarray>
    variant_contig        (variants) int64 0 0 0 0 0 0 0 0 0 0
    variant_position      (variants) int64 0 1 2 3 4 5 6 7 8 9
    variant_allele        (variants, alleles) |S1 b'C' b'G' b'A' ... b'C' b'A'
    sample_id             (samples) <U2 'S0' 'S1'
    call_genotype         (variants, samples, ploidy) int8 0 0 1 0 1 ... 1 0 1 1
    call_genotype_mask    (variants, samples, ploidy) bool False False ... False
Attributes:
    contigs:  [0]

>>> sg.count_variant_alleles(ds, merge=False)
<xarray.Dataset>
Dimensions:               (alleles: 2, variants: 10)
Dimensions without coordinates: alleles, variants
Data variables:
    variant_allele_count  (variants, alleles) uint64 dask.array<chunksize=(10, 2), meta=np.ndarray>

Notice that for the call with merge=False the intermediate variable call_allele_count is not returned (and is effectively lost). I'm not sure if this behaviour is a problem in practice, but it might be something we want to fix.

jeromekelleher commented 3 years ago

Tricky one...

tomwhite commented 3 years ago

I think it would be nice to maintain this invariant:

merge(ds, method(ds, merge=False)) == method(ds)