sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
235 stars 32 forks source link

Debug "Slicing is producing a large chunk" warning #300

Open eric-czech opened 4 years ago

eric-czech commented 4 years ago

I see this warning when running the function mentioned in https://github.com/pystatgen/sgkit/issues/299 on 1KG data:

/home/eczech/miniconda3/envs/sgkit-dev/lib/python3.8/site-packages/xarray/core/indexing.py:1361: PerformanceWarning:
 Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  return self.array[key]

We should figure out how this is possible when the functions applied to a dataset do nothing other than filter within chunks. Presumably this means the chunks should only shrink unlike what is suggested in the warning.

I haven't been able to reproduce this on simulated data yet.

tomwhite commented 4 years ago

I noticed that I get the same warning (on MalariaGEN data) for

ds.isel(samples=ds.sample_cohort != -1)

but not for

ds.isel(samples=ds.sample_cohort >= 0)