malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
14 stars 24 forks source link

Simplify handling of empty datasets #89

Open alimanfoo opened 3 years ago

alimanfoo commented 3 years ago

When access haplotypes data, there are cases where there are no data for a given sample set and analysis, e.g., for the "arab" analysis where a sample set has no arabiensis samples. Currently the approach is to return None in cases like this.

Similarly, when accessing CNV discordant read calls data, there are no data for some contigs. Again the current approach is to return None.

However, it would probably be simpler, internally at least, if in these cases a dataset was returned, but with a 0 length dimension. E.g., if there are no samples for a given haplotypes dataset, return a dataset with 0 length samples dimension. E.g., if there are no variants for a CNV dataset for a given contig, return a dataset with a 0 length variants dimension.

leehart commented 1 year ago

Noting here

haplotypes() has:

for r in region:
    ly = []

    for s in sample_sets:
        y = self._haplotypes_dataset(
            contig=r.contig,
            sample_set=s,
            analysis=analysis,
            inline_array=inline_array,
            chunks=chunks,
        )
        if y is not None:
            ly.append(y)

    if len(ly) == 0:
        debug("early out, no data for given sample sets and analysis")
        return None

_haplotypes_dataset() has:

root = self.open_haplotypes(sample_set=sample_set, analysis=analysis)
sites = self.open_haplotype_sites(analysis=analysis)
if root is None:
    return None

In contrast, cnv_discordant_read_calls() has:

lx = []
for c in contig:

    ly = []
    for s in sample_sets:
        y = self._cnv_discordant_read_calls_dataset(
            contig=c,
            sample_set=s,
            inline_array=inline_array,
            chunks=chunks,
        )
        ly.append(y)

    x = xarray_concat(ly, dim=DIM_SAMPLE)
    lx.append(x)

and _cnv_discordant_read_calls_dataset() has:

root = self.open_cnv_discordant_read_calls(sample_set=sample_set)
if contig not in root:
    raise ValueError(f"no CNVs available for contig {contig!r}")
leehart commented 4 months ago

Since previous comments, the code for haplotypes() appears to have changed to raise a ValueError instead of returning None.

                if len(ly) == 0:
                    # Bail out, no data for given sample sets and analysis.
                    raise ValueError(
                        f"No samples found for phasing analysis {analysis!r}"
                    )
leehart commented 3 months ago

After attempting to create empty xarray datasets with the correct dimensions, coordinates, variables and attributes (which doesn't appear to be as simple as I first imagined) I'm starting to question if we still actually want to be doing this?

Do we have any clear cases where we want to get an empty dataset rather than a descriptive ValueError (where previously None was returned)? E.g. downstream concatenation.

leehart commented 2 months ago

@alimanfoo I'm tempted to close this issue as "not planned" or "stale". Missing CNV and haplotypes data are still being handled without returning zero-length datasets.