Open alimanfoo opened 3 years ago
Noting here
haplotypes()
has:
for r in region:
ly = []
for s in sample_sets:
y = self._haplotypes_dataset(
contig=r.contig,
sample_set=s,
analysis=analysis,
inline_array=inline_array,
chunks=chunks,
)
if y is not None:
ly.append(y)
if len(ly) == 0:
debug("early out, no data for given sample sets and analysis")
return None
_haplotypes_dataset()
has:
root = self.open_haplotypes(sample_set=sample_set, analysis=analysis)
sites = self.open_haplotype_sites(analysis=analysis)
if root is None:
return None
In contrast, cnv_discordant_read_calls()
has:
lx = []
for c in contig:
ly = []
for s in sample_sets:
y = self._cnv_discordant_read_calls_dataset(
contig=c,
sample_set=s,
inline_array=inline_array,
chunks=chunks,
)
ly.append(y)
x = xarray_concat(ly, dim=DIM_SAMPLE)
lx.append(x)
and _cnv_discordant_read_calls_dataset()
has:
root = self.open_cnv_discordant_read_calls(sample_set=sample_set)
if contig not in root:
raise ValueError(f"no CNVs available for contig {contig!r}")
Since previous comments, the code for haplotypes()
appears to have changed to raise a ValueError
instead of returning None
.
if len(ly) == 0:
# Bail out, no data for given sample sets and analysis.
raise ValueError(
f"No samples found for phasing analysis {analysis!r}"
)
After attempting to create empty xarray datasets with the correct dimensions, coordinates, variables and attributes (which doesn't appear to be as simple as I first imagined) I'm starting to question if we still actually want to be doing this?
Do we have any clear cases where we want to get an empty dataset rather than a descriptive ValueError
(where previously None
was returned)? E.g. downstream concatenation.
@alimanfoo I'm tempted to close this issue as "not planned" or "stale". Missing CNV and haplotypes data are still being handled without returning zero-length datasets.
When access haplotypes data, there are cases where there are no data for a given sample set and analysis, e.g., for the "arab" analysis where a sample set has no arabiensis samples. Currently the approach is to return
None
in cases like this.Similarly, when accessing CNV discordant read calls data, there are no data for some contigs. Again the current approach is to return
None
.However, it would probably be simpler, internally at least, if in these cases a dataset was returned, but with a 0 length dimension. E.g., if there are no samples for a given haplotypes dataset, return a dataset with 0 length samples dimension. E.g., if there are no variants for a CNV dataset for a given contig, return a dataset with a 0 length variants dimension.