Better error message when accessing CNV data and some data is missing #363

Open alimanfoo opened 1 year ago

alimanfoo commented 1 year ago

When trying to call functions like ag3.gene_cnv('AGAP001356'), if there are any releases where CNV data is not yet available, this will generate a cryptic error message:

FileNotFoundError: '.zmetadata'

Could be useful to either give a clearer error message, or just return data from the sample_sets that do have them.

leehart commented 1 year ago

It looks like this is only happening (in 7.5.0) when the pre=True option is set, e.g. in Colab

!pip install -q malariagen-data
import malariagen_data

No error:

ag3 = malariagen_data.Ag3()
gene_CNV_freq = ag3.gene_cnv_frequencies(region='2L', cohorts='admin1_year')


ag3_pre = malariagen_data.Ag3(pre=True)
pre_gene_CNV_freq = ag3_pre.gene_cnv_frequencies(region='2L', cohorts='admin1_year')
alimanfoo commented 1 year ago

Yep this error will come and go, depending on the state of the data. If at any point there is an incomplete data release, where sample metadata and SNP data are present but CNV data are not present yet, you will hit this.

Also IIRC we may have a specific problem with Ag3.7 because some sample sets will never have CNV calls because they are outside the species we run CNV calling on.

leehart commented 4 months ago

@cclarkson @ahernank

leehart commented 2 months ago

@alimanfoo As mentioned in #555, it looks like if we skip samples without CNV HMM during cnv_hmm(), which is used by _gene_cnv() and indirectly by gene_cnv_frequencies(), then this will avoid the error messages, but it's not clear to me if there are any unintended implications, e.g. statistically or in plots, which might somehow be misleading.

            lx = []
            for r in regions:
                ly = []
                for s in sample_sets:
                    y = self._cnv_hmm_dataset(

                    # If no CNV HMM dataset was found then skip
                    if y is None:


                debug("concatenate data from multiple sample sets")
                x = simple_xarray_concat(ly, dim=DIM_SAMPLE)

                debug("handle region, do this only once - optimisation")
                if r.start is not None or r.end is not None:
                    start = x["variant_position"].values
                    end = x["variant_end"].values
                    index = pd.IntervalIndex.from_arrays(start, end, closed="both")
                    # noinspection PyArgumentList
                    other = pd.Interval(r.start, r.end, closed="both")
                    loc_region = index.overlaps(other)  # type: ignore
                    x = x.isel(variants=loc_region)
