Open hammer opened 10 months ago
Ah I filed the sample metadata part of this already https://github.com/pystatgen/sgkit/issues/1151
Just noting that merging separate contigs could be easy with
ds = xr.concat([ds1, ds2], dim='variants')
but we also need to update ds.variant_contig
, and I think contig_id
needs to have its contigs
dimension updated, and I think we need to update the contigs
entry in the Dataset Attributes. I need to read more about what we do with the contigs dimension...
(Looks like the contigs
attribute was deprecated in https://github.com/pystatgen/sgkit/issues/1035, so I'll just update contig_id
.)
Eh on second thought that concat
was far too optimistic, only want to do variant data variables.
Okay I think this should do it for merging 2 datasets with no overlapping contigs, which I think will be the most common case. If we wanted to do this safely we'd need to build an index over samples and merge on sample_id
, which may be worth doing someday...
(This was almost there: somehow contig_id
and variant_contig
become regular arrays instead of dask arrays, and I need to rechunk everything to be able to write it as Zarr)
def concat_chrs(chr1, chr2):
new_ds_dict = {}
# Concatenate contig_id and increment chr2 variant_contig indexes by chr1.contigs.size
new_ds_dict['contig_id'] = xr.concat([chr1.contig_id, chr2.contig_id], dim='contigs')
new_ds_dict['variant_contig'] = xr.concat([chr1.variant_contig, chr2.variant_contig + chr1.contigs.size], dim='variants')
# Concatenate remaining variant data variables
data_vars_variants = [
'call_genotype',
'call_genotype_mask',
'variant_allele',
'variant_id',
'variant_position',
]
for dv in data_vars_variants:
new_ds_dict[dv] = xr.concat([chr1[dv], chr2[dv]], dim='variants')
# Copy over sample data variables from chr1
data_vars_samples = [
'sample_family_id',
'sample_id',
'sample_maternal_id',
'sample_member_id',
'sample_paternal_id',
'sample_phenotype',
'sample_sex',
]
for dv in data_vars_samples:
new_ds_dict[dv] = chr1[dv]
return xr.Dataset(data_vars=new_ds_dict)
Operations I find myself doing regularly
We have a TODO in the docs for Adding custom data to a Dataset so the first 2 solutions should probably go there as well.