sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
235 stars 32 forks source link

How do I... additions #1165

Open hammer opened 10 months ago

hammer commented 10 months ago

Operations I find myself doing regularly

We have a TODO in the docs for Adding custom data to a Dataset so the first 2 solutions should probably go there as well.

hammer commented 10 months ago

Ah I filed the sample metadata part of this already https://github.com/pystatgen/sgkit/issues/1151

hammer commented 10 months ago

Just noting that merging separate contigs could be easy with ds = xr.concat([ds1, ds2], dim='variants') but we also need to update ds.variant_contig, and I think contig_id needs to have its contigs dimension updated, and I think we need to update the contigs entry in the Dataset Attributes. I need to read more about what we do with the contigs dimension...

(Looks like the contigs attribute was deprecated in https://github.com/pystatgen/sgkit/issues/1035, so I'll just update contig_id.)

Eh on second thought that concat was far too optimistic, only want to do variant data variables.

hammer commented 10 months ago

Okay I think this should do it for merging 2 datasets with no overlapping contigs, which I think will be the most common case. If we wanted to do this safely we'd need to build an index over samples and merge on sample_id, which may be worth doing someday...

(This was almost there: somehow contig_id and variant_contig become regular arrays instead of dask arrays, and I need to rechunk everything to be able to write it as Zarr)

def concat_chrs(chr1, chr2):
    new_ds_dict = {}

    # Concatenate contig_id and increment chr2 variant_contig indexes by chr1.contigs.size
    new_ds_dict['contig_id'] = xr.concat([chr1.contig_id, chr2.contig_id], dim='contigs')
    new_ds_dict['variant_contig'] = xr.concat([chr1.variant_contig, chr2.variant_contig + chr1.contigs.size], dim='variants')

    # Concatenate remaining variant data variables
    data_vars_variants = [
        'call_genotype',
        'call_genotype_mask',
        'variant_allele',
        'variant_id',
        'variant_position',
        ]
    for dv in data_vars_variants:
        new_ds_dict[dv] = xr.concat([chr1[dv], chr2[dv]], dim='variants')

    # Copy over sample data variables from chr1
    data_vars_samples = [
        'sample_family_id',
        'sample_id',
        'sample_maternal_id',
        'sample_member_id',
        'sample_paternal_id',
        'sample_phenotype',
        'sample_sex',
    ]
    for dv in data_vars_samples:
        new_ds_dict[dv] = chr1[dv]

    return xr.Dataset(data_vars=new_ds_dict)