malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Better error message when accessing CNV data and some data is missing #363

Open alimanfoo opened 1 year ago

alimanfoo commented 1 year ago

When trying to call functions like ag3.gene_cnv('AGAP001356'), if there are any releases where CNV data is not yet available, this will generate a cryptic error message:

FileNotFoundError: '.zmetadata'

Could be useful to either give a clearer error message, or just return data from the sample_sets that do have them.

leehart commented 1 year ago

It looks like this is only happening (in 7.5.0) when the pre=True option is set, e.g. in Colab

!pip install -q malariagen-data
import malariagen_data

No error:

ag3 = malariagen_data.Ag3()
gene_CNV_freq = ag3.gene_cnv_frequencies(region='2L', cohorts='admin1_year')

Error:

ag3_pre = malariagen_data.Ag3(pre=True)
pre_gene_CNV_freq = ag3_pre.gene_cnv_frequencies(region='2L', cohorts='admin1_year')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[/usr/local/lib/python3.9/dist-packages/malariagen_data/ag3.py](https://localhost:8080/#) in open_cnv_hmm(self, sample_set)
    764         try:
--> 765             return self._cache_cnv_hmm[sample_set]
    766         except KeyError:

KeyError: '1230-VO-MULTI-AYALA-AYDI-GA-2204'

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
25 frames
[/usr/local/lib/python3.9/dist-packages/fsspec/mapping.py](https://localhost:8080/#) in __getitem__(self, key, default)
    142         try:
--> 143             result = self.fs.cat(k)
    144         except self.missing_exceptions:

[/usr/local/lib/python3.9/dist-packages/fsspec/asyn.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    114         self = obj or args[0]
--> 115         return sync(self.loop, func, *args, **kwargs)
    116 

[/usr/local/lib/python3.9/dist-packages/fsspec/asyn.py](https://localhost:8080/#) in sync(loop, func, timeout, *args, **kwargs)
     99     elif isinstance(return_result, BaseException):
--> 100         raise return_result
    101     else:

[/usr/local/lib/python3.9/dist-packages/fsspec/asyn.py](https://localhost:8080/#) in _runner(event, coro, result, timeout)
     54     try:
---> 55         result[0] = await coro
     56     except Exception as ex:

[/usr/local/lib/python3.9/dist-packages/fsspec/asyn.py](https://localhost:8080/#) in _cat(self, path, recursive, on_error, batch_size, **kwargs)
    413             if ex:
--> 414                 raise ex
    415         if (

[/usr/lib/python3.9/asyncio/tasks.py](https://localhost:8080/#) in wait_for(fut, timeout, loop)
    441     if timeout is None:
--> 442         return await fut
    443 

[/usr/local/lib/python3.9/dist-packages/gcsfs/core.py](https://localhost:8080/#) in _cat_file(self, path, start, end, **kwargs)
    865             head = {}
--> 866         headers, out = await self._call("GET", u2, headers=head)
    867         return out

[/usr/local/lib/python3.9/dist-packages/gcsfs/core.py](https://localhost:8080/#) in _call(self, method, path, json_out, info_out, *args, **kwargs)
    417         logger.debug(f"{method.upper()}: {path}, {args}, {kwargs.get('headers')}")
--> 418         status, headers, info, contents = await self._request(
    419             method, path, *args, **kwargs

<decorator-gen-119> in _request(self, method, path, headers, json, data, *args, **kwargs)

[/usr/local/lib/python3.9/dist-packages/gcsfs/retry.py](https://localhost:8080/#) in retry_request(func, retries, *args, **kwargs)
    113                 await asyncio.sleep(min(random.random() + 2 ** (retry - 1), 32))
--> 114             return await func(*args, **kwargs)
    115         except (

[/usr/local/lib/python3.9/dist-packages/gcsfs/core.py](https://localhost:8080/#) in _request(self, method, path, headers, json, data, *args, **kwargs)
    410 
--> 411             validate_response(status, contents, path, args)
    412             return status, headers, info, contents

[/usr/local/lib/python3.9/dist-packages/gcsfs/retry.py](https://localhost:8080/#) in validate_response(status, content, path, args)
     82         if status == 404:
---> 83             raise FileNotFoundError(path)
     84 

FileNotFoundError: https://storage.googleapis.com/download/storage/v1/b/vo_agam_release/o/v3.8%2Fcnv%2F1230-VO-MULTI-AYALA-AYDI-GA-2204%2Fhmm%2Fzarr%2F.zmetadata?alt=media

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
[/usr/local/lib/python3.9/dist-packages/malariagen_data/util.py](https://localhost:8080/#) in __getitem__(self, key)
    118             try:
--> 119                 return self._mutable_mapping[key]
    120             except KeyError as e:

[/usr/local/lib/python3.9/dist-packages/fsspec/mapping.py](https://localhost:8080/#) in __getitem__(self, key, default)
    146                 return default
--> 147             raise KeyError(key)
    148         return result

KeyError: '.zmetadata'

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
[<ipython-input-11-1f552c316ddb>](https://localhost:8080/#) in <cell line: 1>()
----> 1 pre_gene_CNV_freq = ag3_pre.gene_cnv_frequencies(region='2L', cohorts='admin1_year')

[/usr/local/lib/python3.9/dist-packages/malariagen_data/ag3.py](https://localhost:8080/#) in gene_cnv_frequencies(self, region, cohorts, sample_query, min_cohort_size, sample_sets, drop_invariant, max_coverage_variance)
   1495         debug("access and concatenate data from regions")
   1496         df = pd.concat(
-> 1497             [
   1498                 self._gene_cnv_frequencies(
   1499                     region=r,

[/usr/local/lib/python3.9/dist-packages/malariagen_data/ag3.py](https://localhost:8080/#) in <listcomp>(.0)
   1496         df = pd.concat(
   1497             [
-> 1498                 self._gene_cnv_frequencies(
   1499                     region=r,
   1500                     cohorts=cohorts,

[/usr/local/lib/python3.9/dist-packages/malariagen_data/ag3.py](https://localhost:8080/#) in _gene_cnv_frequencies(self, region, cohorts, sample_query, min_cohort_size, sample_sets, drop_invariant, max_coverage_variance)
   1533 
   1534         debug("get gene copy number data")
-> 1535         ds_cnv = self.gene_cnv(
   1536             region=region,
   1537             sample_sets=sample_sets,

[/usr/local/lib/python3.9/dist-packages/malariagen_data/ag3.py](https://localhost:8080/#) in gene_cnv(self, region, sample_sets, sample_query, max_coverage_variance)
   1339 
   1340         ds = xarray_concat(
-> 1341             [
   1342                 self._gene_cnv(
   1343                     region=r,

[/usr/local/lib/python3.9/dist-packages/malariagen_data/ag3.py](https://localhost:8080/#) in <listcomp>(.0)
   1340         ds = xarray_concat(
   1341             [
-> 1342                 self._gene_cnv(
   1343                     region=r,
   1344                     sample_sets=sample_sets,

[/usr/local/lib/python3.9/dist-packages/malariagen_data/ag3.py](https://localhost:8080/#) in _gene_cnv(self, region, sample_sets, sample_query, max_coverage_variance)
   1360 
   1361         debug("access HMM data")
-> 1362         ds_hmm = self.cnv_hmm(
   1363             region=region.contig,
   1364             sample_sets=sample_sets,

[/usr/local/lib/python3.9/dist-packages/malariagen_data/ag3.py](https://localhost:8080/#) in cnv_hmm(self, region, sample_sets, sample_query, max_coverage_variance, inline_array, chunks)
    897             ly = []
    898             for s in sample_sets:
--> 899                 y = self._cnv_hmm_dataset(
    900                     contig=r.contig,
    901                     sample_set=s,

[/usr/local/lib/python3.9/dist-packages/malariagen_data/ag3.py](https://localhost:8080/#) in _cnv_hmm_dataset(self, contig, sample_set, inline_array, chunks)
    780 
    781         debug("open zarr")
--> 782         root = self.open_cnv_hmm(sample_set=sample_set)
    783 
    784         debug("variant arrays")

[/usr/local/lib/python3.9/dist-packages/malariagen_data/ag3.py](https://localhost:8080/#) in open_cnv_hmm(self, sample_set)
    769             path = f"{self._base_path}/{release_path}/cnv/{sample_set}/hmm/zarr"
    770             store = init_zarr_store(fs=self._fs, path=path)
--> 771             root = zarr.open_consolidated(store=store)
    772             self._cache_cnv_hmm[sample_set] = root
    773         return root

[/usr/local/lib/python3.9/dist-packages/zarr/convenience.py](https://localhost:8080/#) in open_consolidated(store, metadata_key, mode, **kwargs)
   1297 
   1298     # setup metadata store
-> 1299     meta_store = ConsolidatedStoreClass(store, metadata_key=metadata_key)
   1300 
   1301     # pass through

[/usr/local/lib/python3.9/dist-packages/zarr/storage.py](https://localhost:8080/#) in __init__(self, store, metadata_key)
   2884 
   2885         # retrieve consolidated metadata
-> 2886         meta = json_loads(self.store[metadata_key])
   2887 
   2888         # check format of consolidated metadata

[/usr/local/lib/python3.9/dist-packages/malariagen_data/util.py](https://localhost:8080/#) in __getitem__(self, key)
    120             except KeyError as e:
    121                 # raise a different error to ensure zarr propagates the exception, rather than filling
--> 122                 raise FileNotFoundError(e)
    123 
    124         def __contains__(self, key):

FileNotFoundError: '.zmetadata'
alimanfoo commented 1 year ago

Yep this error will come and go, depending on the state of the data. If at any point there is an incomplete data release, where sample metadata and SNP data are present but CNV data are not present yet, you will hit this.

Also IIRC we may have a specific problem with Ag3.7 because some sample sets will never have CNV calls because they are outside the species we run CNV calling on.

leehart commented 4 months ago

@cclarkson @ahernank

leehart commented 2 months ago

@alimanfoo As mentioned in #555, it looks like if we skip samples without CNV HMM during cnv_hmm(), which is used by _gene_cnv() and indirectly by gene_cnv_frequencies(), then this will avoid the error messages, but it's not clear to me if there are any unintended implications, e.g. statistically or in plots, which might somehow be misleading.

            lx = []
            for r in regions:
                ly = []
                for s in sample_sets:
                    y = self._cnv_hmm_dataset(
                        contig=r.contig,
                        sample_set=s,
                        inline_array=inline_array,
                        chunks=chunks,
                    )

                    # If no CNV HMM dataset was found then skip
                    if y is None:
                        continue

                    ly.append(y)

                debug("concatenate data from multiple sample sets")
                x = simple_xarray_concat(ly, dim=DIM_SAMPLE)

                debug("handle region, do this only once - optimisation")
                if r.start is not None or r.end is not None:
                    start = x["variant_position"].values
                    end = x["variant_end"].values
                    index = pd.IntervalIndex.from_arrays(start, end, closed="both")
                    # noinspection PyArgumentList
                    other = pd.Interval(r.start, r.end, closed="both")
                    loc_region = index.overlaps(other)  # type: ignore
                    x = x.isel(variants=loc_region)

                lx.append(x)