malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Simplify handling of empty datasets #89

Open alimanfoo opened 2 years ago

alimanfoo commented 2 years ago

When access haplotypes data, there are cases where there are no data for a given sample set and analysis, e.g., for the "arab" analysis where a sample set has no arabiensis samples. Currently the approach is to return None in cases like this.

Similarly, when accessing CNV discordant read calls data, there are no data for some contigs. Again the current approach is to return None.

However, it would probably be simpler, internally at least, if in these cases a dataset was returned, but with a 0 length dimension. E.g., if there are no samples for a given haplotypes dataset, return a dataset with 0 length samples dimension. E.g., if there are no variants for a CNV dataset for a given contig, return a dataset with a 0 length variants dimension.

leehart commented 1 year ago

Noting here

haplotypes() has:

for r in region:
    ly = []

    for s in sample_sets:
        y = self._haplotypes_dataset(
            contig=r.contig,
            sample_set=s,
            analysis=analysis,
            inline_array=inline_array,
            chunks=chunks,
        )
        if y is not None:
            ly.append(y)

    if len(ly) == 0:
        debug("early out, no data for given sample sets and analysis")
        return None

_haplotypes_dataset() has:

root = self.open_haplotypes(sample_set=sample_set, analysis=analysis)
sites = self.open_haplotype_sites(analysis=analysis)
if root is None:
    return None

In contrast, cnv_discordant_read_calls() has:

lx = []
for c in contig:

    ly = []
    for s in sample_sets:
        y = self._cnv_discordant_read_calls_dataset(
            contig=c,
            sample_set=s,
            inline_array=inline_array,
            chunks=chunks,
        )
        ly.append(y)

    x = xarray_concat(ly, dim=DIM_SAMPLE)
    lx.append(x)

and _cnv_discordant_read_calls_dataset() has:

root = self.open_cnv_discordant_read_calls(sample_set=sample_set)
if contig not in root:
    raise ValueError(f"no CNVs available for contig {contig!r}")
leehart commented 1 month ago

Since previous comments, the code for haplotypes() appears to have changed to raise a ValueError instead of returning None.

                if len(ly) == 0:
                    # Bail out, no data for given sample sets and analysis.
                    raise ValueError(
                        f"No samples found for phasing analysis {analysis!r}"
                    )
leehart commented 1 month ago

After attempting to create empty xarray datasets with the correct dimensions, coordinates, variables and attributes (which doesn't appear to be as simple as I first imagined) I'm starting to question if we still actually want to be doing this?

Do we have any clear cases where we want to get an empty dataset rather than a descriptive ValueError (where previously None was returned)? E.g. downstream concatenation.