Docker container + Pycharm + scanpy + anndata

Please make sure these conditions are met

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of anndata.
[ ] (optional) I have confirmed this bug exists on the master branch of anndata.

Report

Hi! I am trying to coordinate several things to set up a Docker image that runs as an interpreter for Pycharm. Needless to say that in the Docker image, anndata 0.10.5.post1 is installed (via scvi-tools & scverse) and its is behaving weirdly.

I provide the Dockerfile and env.yaml information. The anndata version and dependencies are specified below.

Dockerfile:

FROM mambaorg/micromamba:1.5-jammy-cuda-12.2.2 as micromamba39
COPY --chown=$MAMBA_USER:$MAMBA_USER env.yaml /tmp/env.yaml

USER root

RUN micromamba install -y -n base -f /tmp/env.yaml && \
    micromamba clean --all --yes
RUN micromamba install -y -n base -c conda-forge scvi-tools==1.0.4
ENV PATH /opt/conda/bin:$PATH #retrieves pip to the environment
RUN pip install --yes gget

env.yaml

name: base
channels:
  - conda-forge
  - pytorch
  - defaults
  - nvidia
  - bioconda
dependencies:
  - pyopenssl=20.0.1
  - python=3.9.1
  - requests=2.25.1
  - pip
  - numpy=1.23.5
  - numba=0.53.1
  - pytorch::pytorch=2.1=py3.9_cuda12.1*
  - torchvision=>0.16
  - torchaudio
  - biopython=>1.83
  - h5py=3.9.0
  - scipy>=1.12
  - scikit-learn>=1.4.0
  - pandas>=2.2
  - matplotlib>=3.8.2
  - hdf5plugin>=4.1.3
  - pyro-ppl=>1.8
  - scanpy>=1.9.8
  - bioconda::gget>=0.28.4
  - dill>=0.3.8
  - umap-learn>=0.5.5
  - seaborn>=0.13.2
  - dataframe_image
  - xmltodict
  - nbconvert==6.4.3
  - bioconda::cellxgene
  - conda-forge::lightning==2.0.1

The "problematic" code ... which is just reading an h5ad file that you can download from here:

OLD (dataset0b): https://drive.google.com/file/d/1NGo4TEu0WizC3xSta-zaumi6n5eUdOMF/view?usp=sharing

NEW (dataset0): https://drive.google.com/file/d/1uSKeuWi7_CK8-jOUIqX9YrFA_9_MzYp0/view?usp=sharing

import scvi
import anndata
import os
import scanpy as sc
scvi.settings.seed = 0
sc_path = ".../single_cell_data.h5ad"

if os.path.exists(all_tissues_path_toy):
    print("Exists")
    print("Read permissions: {}".format(os.access(sc_path, os.R_OK)))
    print("Write permissions: {}".format(os.access(sc_path, os.W_OK)))
    print("Execute permissions: {}".format(os.access(sc_path, os.EX_OK)))
else:
    print("Does not exist")

adata  = sc.read_h5ad(sc_path)
print(adata)

Using the Docker image either as the interpreter in Pycharm or by running the image (docker run -it --gpus all -v "source:target:Z" bash) generates the following ERROR:

Exists
Read permissions: True
Write permissions: True
Execute permissions: True
python-BaseException
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 261, in read_h5ad
    adata = read_dispatched(f, callback=callback)
  File "/opt/conda/lib/python3.9/site-packages/anndata/experimental/_dispatch_io.py", line 48, in read_dispatched
    return reader.read_elem(elem)
  File "/opt/conda/lib/python3.9/site-packages/anndata/_io/utils.py", line 207, in func_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/anndata/_io/specs/registry.py", line 256, in read_elem
    return self.callback(read_func, elem.name, elem, iospec=iospec)
  File "/opt/conda/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 242, in callback
    **{
  File "/opt/conda/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 245, in <dictcomp>
    k: read_dispatched(elem[k], callback)
  File "/opt/conda/lib/python3.9/site-packages/anndata/experimental/_dispatch_io.py", line 48, in read_dispatched
    return reader.read_elem(elem)
  File "/opt/conda/lib/python3.9/site-packages/anndata/_io/utils.py", line 207, in func_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/anndata/_io/specs/registry.py", line 256, in read_elem
    return self.callback(read_func, elem.name, elem, iospec=iospec)
  File "/opt/conda/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 258, in callback
    return read_dataframe(elem)
  File "/opt/conda/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 318, in read_dataframe
    return read_dataframe_legacy(group)
  File "/opt/conda/lib/python3.9/site-packages/anndata/_io/utils.py", line 207, in func_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 305, in read_dataframe_legacy
    _decode_structured_array(
  File "/opt/conda/lib/python3.9/site-packages/anndata/compat/__init__.py", line 241, in _decode_structured_array
    for k, (dt, _) in dtype.fields.items():
AttributeError: 'NoneType' object has no attribute 'items'

However when I install the same environment locally in my host machine, I can read the file perfectly fine. The output is:

Exists
Read permissions: True
Write permissions: True
Execute permissions: True
~/micromamba/lib/python3.9/site-packages/anndata/_core/anndata.py:1906: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")
AnnData object with n_obs × n_vars = 91588 × 589
    obs: 'dataset_id', 'assay', 'suspension_type', 'sex', 'tissue_general', 'tissue', 'cell_type', 'is_primary_data', 'assay_ontology_term_id', 'tissue_name'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

At the beginning I thought that the error was due to a lack of read/write/execution permissions between the host and the docker container mount, however that does not seem the case. The error is that the dtype.fields becomes None in the original scanpy function __decode_structuredarray

def _decode_structured_array(
    arr: np.ndarray, dtype: np.dtype | None = None, copy: bool = False
) -> np.ndarray:
    """
    h5py 3.0 now reads all strings as bytes. There is a helper method which can convert these to strings,
    but there isn't anything for fields of structured dtypes.

    Params
    ------
    arr
        An array with structured dtype
    dtype
        dtype of the array. This is checked for h5py string data types.
        Passing this is allowed for cases where array may have been processed by another function before hand.
    """
    if copy:
        arr = arr.copy()
    if dtype is None:
        dtype = arr.dtype #THIS DTYPE is correct but since dtype is not None it does not get picked up
    # codecs.decode is 2x slower than this lambda, go figure
    decode = np.frompyfunc(lambda x: x.decode("utf-8"), 1, 1)
    for k, (dt, _) in dtype.fields.items():
        check = h5py.check_string_dtype(dt)
        if check is not None and check.encoding == "utf-8":
            decode(arr[k], out=arr[k])
    return arr

I am not sure where the dtype is specified prior to this function (since dtype is not None when it reaches the _decode_structured_array function) and why the read behavior is different in the host machine versus the docker container. Note that the arr.dtype seems correct but it is ignored.

Perhaps you have an insight on what could be going on?

Thank you in advance. Let me know if you require anything else

Versions

-----
anndata             0.10.5.post1
scanpy              1.9.8
scvi                1.0.4
session_info        1.0.0
-----
OpenSSL             24.0.0
PIL                 10.2.0
absl                NA
aiohttp             3.9.3
aiosignal           1.3.1
anyio               NA
async_timeout       4.0.3
attr                23.2.0
botocore            1.34.39
brotli              1.1.0
bs4                 4.12.3
certifi             2024.02.02
cffi                1.16.0
chardet             4.0.0
chex                0.1.83
click               8.1.7
colorama            0.4.6
contextlib2         NA
croniter            NA
cryptography        41.0.7
cycler              0.12.1
cython_runtime      NA
dateutil            2.8.2
deepdiff            6.7.1
defusedxml          0.7.1
dill                0.3.8
docrep              0.3.2
exceptiongroup      1.2.0
fastapi             0.88.0
flax                0.6.1
frozenlist          1.4.1
fsspec              2023.12.2
gmpy2               2.1.2
google              NA
h5py                3.9.0
html5lib            1.1
idna                2.10
importlib_metadata  NA
importlib_resources NA
jax                 0.4.13
jaxlib              0.4.12
jmespath            1.0.1
joblib              1.3.2
kiwisolver          1.4.5
lightning           2.0.1
lightning_cloud     0.5.64
lightning_utilities 0.10.1
llvmlite            0.36.0
lxml                5.1.0
matplotlib          3.8.2
ml_collections      NA
ml_dtypes           0.3.2
mpl_toolkits        NA
mpmath              1.3.0
msgpack             1.0.7
mudata              0.2.3
multidict           6.0.5
multipart           0.0.8
multipledispatch    0.6.0
natsort             8.4.0
numba               0.53.1
numpy               1.23.5
numpyro             0.13.2
opt_einsum          v3.3.0
optax               0.1.9
ordered_set         4.1.0
orjson              3.9.10
packaging           23.2
pandas              2.2.0
pkg_resources       NA
psutil              5.9.8
pydantic            1.10.13
pygments            2.17.2
pyparsing           3.1.1
pyro                1.8.6+4be5c2e
pytz                2024.1
requests            2.25.1
rich                NA
s3fs                0.4.2
scipy               1.12.0
sitecustomize       NA
six                 1.16.0
sklearn             1.4.0
sniffio             1.3.0
socks               1.7.1
soupsieve           2.5
sparse              0.15.1
starlette           0.22.0
sympy               1.12
threadpoolctl       3.2.0
tomli               2.0.1
toolz               0.12.1
torch               2.1.0
torchaudio          2.1.0
torchgen            NA
torchmetrics        1.2.1
torchvision         0.16.0
tqdm                4.66.2
typing_extensions   NA
urllib3             1.26.18
uvicorn             0.27.1
webencodings        0.5.1
websocket           1.7.0
websockets          11.0.3
xarray              2023.7.0
yaml                6.0.1
yarl                1.9.4
zipp                NA
zoneinfo            NA
-----
Python 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) [GCC 12.3.0]
Linux-6.5.0-17-generic-x86_64-with-glibc2.35
-----
Session information updated at 2024-02-12 15:12

@LysSanzMoreta Could you clarify whether the Versions you posted are for local or the docker env? I can reproduce this locally, actually, on my Mac using the main branch now. I am interested how this is working for you locally. From what I can see, the dtype that should have mapping between column of the dataframe and dtype simply does not have it.

-----
IPython             8.18.1
anndata             0.11.0.dev74+g71ed758.d20240220
session_info        1.0.0
-----
asciitree           NA
asttokens           NA
attr                23.1.0
awkward             2.5.1
awkward_cpp         NA
cloudpickle         3.0.0
cython_runtime      NA
dask                2023.12.0
dateutil            2.8.2
decorator           5.1.1
executing           2.0.1
fasteners           0.19
h5py                3.10.0
importlib_metadata  NA
jedi                0.19.1
jinja2              3.1.2
markupsafe          2.1.3
msgpack             1.0.7
natsort             8.4.0
numcodecs           0.12.1
numpy               1.26.2
packaging           23.2
pandas              2.1.4
parso               0.8.3
pexpect             4.9.0
prompt_toolkit      3.0.43
psutil              5.9.6
ptyprocess          0.7.0
pure_eval           0.2.2
pyarrow             14.0.1
pygments            2.17.2
pytz                2023.3.post1
rich                NA
scipy               1.11.4
setuptools          69.0.3
setuptools_scm      NA
sitecustomize       NA
six                 1.16.0
sphinxcontrib       NA
stack_data          0.6.3
tblib               3.0.0
tlz                 0.12.0
toolz               0.12.0
traitlets           5.14.0
wcwidth             0.2.12
yaml                6.0.1
zarr                2.16.1
zipp                NA
-----
Python 3.11.6 (main, Nov  2 2023, 04:39:43) [Clang 14.0.3 (clang-1403.0.22.14.1)]
macOS-13.6.1-arm64-arm-64bit
-----
Session information updated at 2024-02-20 10:30

One thing I notice is that it seems this dataset from the google drive link has shape (91588, 1330) and not (91588, 589) as your "working example" shows. Using the dataset you linked to in the google drive link:

import h5py
f = h5py.File('single_cell_data.h5ad', 'r')
f['X']
# <HDF5 dataset "X": shape (91588, 1330), type "<f4">

So just some clarification on what is happening locally would be great. I wonder if the file you sent is somehow corrupted.

@ilan-gold Sorry about the dataset difference, I have 2 toy examples I was playing with, both had the same issue, but I have uploaded the one with 91588x589 datapoints as well.

I also thought the file could be corrupted, but I could not figure out how to check for that, specially when it was working in the local host machine. See the installation script below.

The .yaml file that I was using last week, was the same both for the docker and the host machine. I have no clue why it was only working in the host machine.

I gotta say that I am new to Docker and last week I managed to crash my computer due to been unaware of the need to remove "untagged" images as well ... ops, so I had to reinstall everything from scratch. I have also realized that I should not be so harsh on package version specification and I have softened the .yaml and Dockerfiles. The harsh version specification was making very hard for micromamba to resolve a working environment and everything was crashing/incompatible.

So on that note, I have re-installed everything as stated by the new guidelines below. This has led me to been able to read both in the Docker container and the host machine the file labelled as NEW(dataset0) and still NOT been able to read the OLD(dataset0b) in the Docker container environment or the host machine. Actually, I am more interested in reading the OLD(dataset0b) because it contains more genes, despite of the name "OLD" (old is just in terms of this github issue).

Definitely there could be some dataset corruption, but there might be some library incompatibility somewhere as well? It is interesting since both files are generated using the same script ... one just has more genes than the other one. With the previous environment I could not read any of them in the Docker container.

I have bypassed the issue by instead reading the OLD(dataset0b) using thesc.read_hdfcommand. So this code works in every environment:

adata = sc.read_hdf(sc_path, key="X")

which makes me think the dataset is not corrupted .. or?

NEW INSTALLATION GUIDELINES

.yaml file

name: base
channels:
  - conda-forge
  - pytorch
  - defaults
  - nvidia
  - bioconda
dependencies:
  - pyopenssl
  - python=3.9
  - requests
  - pip:
    - -r requirements.txt
  - numpy
  - pytorch::pytorch=2.1=py3.9_cuda12.1*
  - torchvision=>0.16
  - torchaudio
  - biopython
  - h5py
  - scipy
  - scikit-learn
  - pandas
  - matplotlib
  - hdf5plugin
  - pyro-ppl
  - scanpy
  - dill
  - umap-learn
  - seaborn
  - dataframe_image
  - xmltodict
  - nbconvert
  - bioconda::cellxgene
  - conda-forge::lightning
  - numba
  - conda-forge::scvi-tools==1.0.4
  - conda-forge::jax[cuda12_pip]
  - conda-forge::jaxlib
  - conda-forge::huggingface_hub
  - conda-forge::leidenalg

requirements.txt

gget

DockerFile

FROM mambaorg/micromamba:1.5-jammy-cuda-12.2.2 as micromamba39
COPY --chown=$MAMBA_USER:$MAMBA_USER env.yaml /tmp/env.yaml
COPY --chown=$MAMBA_USER:$MAMBA_USER requirements.txt /tmp/requirements.txt
USER root
RUN micromamba install -y -n base -f /tmp/env.yaml && \
    micromamba clean --all --yes
CMD ["/usr/bin/which","python3"] -> "/usr/bin/python3"
CMD ["/usr/bin/python3","/opt/.pycharm_helpers/packaging_tool.py","list"]
CMD ["/usr/bin/python3","/opt/.pycharm_helpers/remote_sync.py","--state-file","/tmp/e06641bc-ea0b-4a32-a780-2ed112e01659/.state.json","/tmp/bacecef2-e56a-4f46-9a40-fc180ff6cdd3"]

For the host machine I install it like:

#!/bin/bash
micromamba env remove -n base
micromamba install -f env.yaml

Scanpy version


scanpy                                1.9.8            pyhd8ed1ab_0                 conda-forge
``

Could you run the following in both settings and tell me what is in f['obs'].dtype.descr?

import h5py
f = h5py.File('single_cell_data.h5ad', 'r')
print(f['obs'].dtype.descr)

@ilan-gold Yes!

File OLD(dataset0b) (which can NOT be read withsc.read_h5adand YES with sc.read.hdf)

Docker container print output:

[('', '|O')]

Local host machine:

[('', '|O')]

File NEW(dataset0): (which can be read withsc.read_h5adand sc.read.hdf)

Docker container:

print(f['obs'].dtype.descr)
AttributeError: 'Group' object has no attribute 'dtype'

Local host machine:

print(f['obs'].dtype.descr)
AttributeError: 'Group' object has no attribute 'dtype'

So on the good side, with the new environment, it seems like the Docker container and the local environment behave similarly.

I hope the print output sparks an inspiration on what is going on :)

@LysSanzMoreta You said you

still [had] NOT been able to read the OLD(dataset0b) in the Docker container environment or the host machine

which means the environments are behaving consistently now. Is that right? So this is different than when you reported the issue when they were acting differently, no?

What this little experiment highlights to me is that I do think the file is corrupted especially since the behavior is now identical across both. f['obs'].dtype.descr was supposed to contain column names for the obs dataframe, but as we see, it has none in either environment. And the reason this experiment doesn't work on the new file is that the way we write dataframes has changed since OLD was written. I would guess that OLD actually is old (or it was produced with a very old environment).

@ilan-gold Yeah, I would say the environments are working consistently now... (I never put my hand on fire for this kind of affirmations hehhe)

Ok ... interesting. I made both dataframes recently (2023-2024) (first NEW(dataset0) and then OLD(dataset0b)) so I would be surprised about that. However, I guess it is good that we can pin point to the file been corrupted somehow... I will perhaps review the pipeline then and see what I can do...although with the sc.read.hdf I have been able to, for now bypass this issue.

If you have other ideas of what could it be, let me know ... I will investigate the obs object, since you seem to point it as the corrupted part.

Thanks for your time!

Hey @LysSanzMoreta, I was talking to Ilan about this and it sounds like this OLD(dataset0b) was written by an old version of anndata, where I may be more help. Looking at this file now, I think this version of anndata may even pre-date me 😅.

Do you have the environment that created this file? I'd be curious to see what version of anndata wrote it.

cc: @flying-sheep

@ivirshup Well, let me see

The way that I am creating the files is by appending a bunch of gget adata files to each other like this:

max_genes = len(genes_list)
filed_dataframe=False
for genes in genes_sets:
     adata = gget.cellxgene(....)
     X_df = adata.X
     .... some processing to remove duplicated genes etc....
     if not filled_dataframe: #first time appending to the dataframe
          print("First time appending to the hdf5 file")
          all_tissues_storage.create_dataset('X',
                                             data=X_df.to_numpy(),
                                             compression="gzip",
                                             chunks=True,
                                             maxshape=(None, max_genes)) 
         # .... append other keys similarly-----
     else:
          n_x = X_df.shape[0]
          all_tissues_storage['X'].resize((all_tissues_storage['X'].shape[0] + n_x), axis=0)
                        all_tissues_storage['X'][-n_x:] = X_df.to_numpy()
         #---append other keys similarly ------

An example of the gget adata file is: https://drive.google.com/file/d/1888V6da32RKRhG4vzj0B1eQ7oGfXTZty/view?usp=drive_link

The downloaded gget adata files (I have checked a few of them) respond like this:

i) Cannot be read with sc.read_hdf()

   adata  =  sc.read_hdf(sc_path, key="X")
  File "/home/lys/micromamba/lib/python3.9/site-packages/anndata/_io/read.py", line 130, in read_hdf
    X = f[key][()]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/lys/micromamba/lib/python3.9/site-packages/h5py/_hl/group.py", line 359, in __getitem__
    raise TypeError("Accessing a group is done with bytes or str, "
TypeError: Accessing a group is done with bytes or str, not <class 'tuple'>

ii) Can be read with sc.read_h5ad()

iii) Do not have the dtype attribute after using the _f = h5py.File(scpath, 'r') command

    print(f['obs'].dtype.descr)
AttributeError: 'Group' object has no attribute 'dtype'

My current exact environment is:

Env info

``` name: base channels: - bioconda - conda-forge - nvidia - pytorch dependencies: - _libgcc_mutex=0.1=conda_forge - _openmp_mutex=4.5=2_kmp_llvm - absl-py=2.1.0=pyhd8ed1ab_0 - aiohttp=3.9.3=py39hd1e30aa_0 - aiosignal=1.3.1=pyhd8ed1ab_0 - alsa-lib=1.2.10=hd590300_0 - aniso8601=9.0.1=pyhd8ed1ab_0 - anndata=0.10.5.post1=pyhd8ed1ab_0 - annotated-types=0.6.0=pyhd8ed1ab_0 - anyio=3.7.1=pyhd8ed1ab_0 - aom=3.7.1=h59595ed_0 - arpack=3.8.0=nompi_h0baa96a_101 - array-api-compat=1.4.1=pyhd8ed1ab_0 - arrow=1.3.0=pyhd8ed1ab_0 - async-timeout=4.0.3=pyhd8ed1ab_0 - attr=2.5.1=h166bdaf_1 - attrs=23.2.0=pyh71513ae_0 - azure-core-cpp=1.10.3=h91d86a7_1 - azure-storage-blobs-cpp=12.10.0=h00ab1b0_0 - azure-storage-common-cpp=12.5.0=hb858b4b_2 - backoff=2.2.1=pyhd8ed1ab_0 - beautifulsoup4=4.12.3=pyha770c72_0 - biopython=1.83=py39hd1e30aa_0 - blas=2.116=mkl - blas-devel=3.9.0=16_linux64_mkl - bleach=6.1.0=pyhd8ed1ab_0 - blessed=1.19.1=pyhe4f9e05_2 - blinker=1.7.0=pyhd8ed1ab_0 - blosc=1.21.5=h0f2a231_0 - boto3=1.34.41=pyhd8ed1ab_0 - botocore=1.34.41=pyhd8ed1ab_0 - brotli=1.1.0=hd590300_1 - brotli-bin=1.1.0=hd590300_1 - brotli-python=1.1.0=py39h3d6467e_1 - bzip2=1.0.8=hd590300_5 - c-ares=1.26.0=hd590300_0 - ca-certificates=2024.2.2=hbcca054_0 - cachecontrol=0.13.1=pyhd8ed1ab_0 - cachecontrol-with-filecache=0.13.1=pyhd8ed1ab_0 - cached-property=1.5.2=hd8ed1ab_1 - cached_property=1.5.2=pyha770c72_1 - cairo=1.18.0=h3faef2a_0 - cellxgene=0.16.7=py_0 - certifi=2024.2.2=pyhd8ed1ab_0 - cffi=1.16.0=py39h7a31438_0 - charset-normalizer=3.3.2=pyhd8ed1ab_0 - chex=0.1.85=pyhd8ed1ab_0 - cleo=2.1.0=pyhd8ed1ab_0 - click=8.1.7=unix_pyh707e725_0 - colorama=0.4.6=pyhd8ed1ab_0 - contextlib2=21.6.0=pyhd8ed1ab_0 - contourpy=1.2.0=py39h7633fee_0 - crashtest=0.4.1=pyhd8ed1ab_0 - croniter=1.4.1=pyhd8ed1ab_0 - cryptography=42.0.2=py39he6105cc_0 - cuda-cudart=12.1.105=0 - cuda-cupti=12.1.105=0 - cuda-libraries=12.1.0=0 - cuda-nvrtc=12.1.105=0 - cuda-nvtx=12.1.105=0 - cuda-opencl=12.3.101=0 - cuda-runtime=12.1.0=0 - cuda-version=11.8=h70ddcb2_2 - cudatoolkit=11.8.0=h4ba93d1_13 - cudnn=8.8.0.121=hd5ab71f_4 - cycler=0.12.1=pyhd8ed1ab_0 - dataframe_image=0.1.1=py_0 - dateutils=0.6.12=py_0 - dav1d=1.2.1=hd590300_0 - dbus=1.13.6=h5008d03_3 - deepdiff=6.7.1=pyhd8ed1ab_0 - defusedxml=0.7.1=pyhd8ed1ab_0 - dill=0.3.8=pyhd8ed1ab_0 - distlib=0.3.8=pyhd8ed1ab_0 - docrep=0.3.2=pyh44b312d_0 - dulwich=0.21.7=py39hd1e30aa_0 - entrypoints=0.4=pyhd8ed1ab_0 - et_xmlfile=1.1.0=pyhd8ed1ab_0 - etils=1.6.0=pyhd8ed1ab_0 - exceptiongroup=1.2.0=pyhd8ed1ab_2 - expat=2.5.0=hcb278e6_1 - fastapi=0.109.2=pyhd8ed1ab_0 - fastobo=0.12.3=py39he10ea66_0 - ffmpeg=6.1.1=gpl_hf3b701a_101 - filelock=3.13.1=pyhd8ed1ab_0 - flask=3.0.2=pyhd8ed1ab_0 - flask-compress=1.14=pyhd8ed1ab_0 - flask-cors=4.0.0=pyhd8ed1ab_0 - flask-restful=0.3.10=pyhd8ed1ab_0 - flask-server-timing=0.1.2=pyh9f0ad1d_0 - flask-talisman=1.1.0=pyhd8ed1ab_0 - flatten-dict=0.4.2=pyhd8ed1ab_1 - flax=0.8.1=pyhd8ed1ab_0 - font-ttf-dejavu-sans-mono=2.37=hab24e00_0 - font-ttf-inconsolata=3.000=h77eed37_0 - font-ttf-source-code-pro=2.038=h77eed37_0 - font-ttf-ubuntu=0.83=h77eed37_1 - fontconfig=2.14.2=h14ed4e7_0 - fonts-conda-ecosystem=1=0 - fonts-conda-forge=1=0 - fonttools=4.48.1=py39hd1e30aa_0 - freetype=2.12.1=h267a509_2 - fribidi=1.0.10=h516909a_0 - frozenlist=1.4.1=py39hd1e30aa_0 - fsspec=2024.2.0=pyhca7485f_0 - get-annotations=0.1.2=pyhd8ed1ab_0 - gettext=0.21.1=h27087fc_0 - glib=2.78.3=hfc55251_0 - glib-tools=2.78.3=hfc55251_0 - glpk=5.0=h445213a_0 - gmp=6.3.0=h59595ed_0 - gmpy2=2.1.2=py39h376b7d2_1 - gnutls=3.7.9=hb077bed_0 - graphite2=1.3.13=he1b5a44_1001 - greenlet=3.0.3=py39h3d6467e_0 - gst-plugins-base=1.22.9=h8e1006c_0 - gstreamer=1.22.9=h98fc4e7_0 - gunicorn=21.2.0=py39hf3d152e_1 - h11=0.14.0=pyhd8ed1ab_0 - h5py=3.10.0=nompi_py39h2c511df_101 - harfbuzz=8.3.0=h3d44ed6_0 - hdf5=1.14.3=nompi_h4f84152_100 - hdf5plugin=4.4.0=py39hd6052ec_0 - huggingface_hub=0.20.2=pyhd8ed1ab_0 - icu=73.2=h59595ed_0 - idna=3.6=pyhd8ed1ab_0 - igraph=0.10.10=h153f77b_0 - importlib-metadata=7.0.1=pyha770c72_0 - importlib-resources=6.1.1=pyhd8ed1ab_0 - importlib_metadata=7.0.1=hd8ed1ab_0 - importlib_resources=6.1.1=pyhd8ed1ab_0 - inquirer=3.1.4=pyhd8ed1ab_0 - itsdangerous=2.1.2=pyhd8ed1ab_0 - jaraco.classes=3.3.1=pyhd8ed1ab_0 - jax=0.4.23=pyhd8ed1ab_0 - jaxlib=0.4.23=cuda118py39hb35ebbd_200 - jeepney=0.8.0=pyhd8ed1ab_0 - jinja2=3.1.3=pyhd8ed1ab_0 - jmespath=1.0.1=pyhd8ed1ab_0 - joblib=1.3.2=pyhd8ed1ab_0 - jsonschema=4.21.1=pyhd8ed1ab_0 - jsonschema-specifications=2023.12.1=pyhd8ed1ab_0 - jupyter_client=8.6.0=pyhd8ed1ab_0 - jupyter_core=5.7.1=py39hf3d152e_0 - jupyterlab_pygments=0.3.0=pyhd8ed1ab_1 - keyring=24.3.0=py39hf3d152e_0 - keyutils=1.6.1=h166bdaf_0 - kiwisolver=1.4.5=py39h7633fee_1 - krb5=1.21.2=h659d440_0 - lame=3.100=h166bdaf_1003 - lcms2=2.16=hb7c19ff_0 - ld_impl_linux-64=2.40=h41732ed_0 - leidenalg=0.10.2=py39h3d6467e_0 - lerc=4.0.0=h27087fc_0 - libabseil=20230802.1=cxx17_h59595ed_0 - libaec=1.1.2=h59595ed_1 - libass=0.17.1=h8fe9dca_1 - libavif16=1.0.3=hef5bec9_1 - libblas=3.9.0=16_linux64_mkl - libbrotlicommon=1.1.0=hd590300_1 - libbrotlidec=1.1.0=hd590300_1 - libbrotlienc=1.1.0=hd590300_1 - libcap=2.69=h0f662aa_0 - libcblas=3.9.0=16_linux64_mkl - libclang=15.0.7=default_hb11cfb5_4 - libclang13=15.0.7=default_ha2b6cf4_4 - libcrc32c=1.1.2=h9c3ff4c_0 - libcublas=12.1.0.26=0 - libcufft=11.0.2.4=0 - libcufile=1.8.1.2=0 - libcups=2.3.3=h4637d8d_4 - libcurand=10.3.4.107=0 - libcurl=8.5.0=hca28451_0 - libcusolver=11.4.4.55=0 - libcusparse=12.0.2.55=0 - libdeflate=1.19=hd590300_0 - libdrm=2.4.114=h166bdaf_0 - libedit=3.1.20191231=he28a2e2_2 - libev=4.33=hd590300_2 - libevent=2.1.12=hf998b51_1 - libexpat=2.5.0=hcb278e6_1 - libffi=3.4.2=h7f98852_5 - libflac=1.4.3=h59595ed_0 - libgcc-ng=13.2.0=h807b86a_5 - libgcrypt=1.10.3=hd590300_0 - libgfortran-ng=13.2.0=h69a702a_5 - libgfortran5=13.2.0=ha4646dd_5 - libglib=2.78.3=h783c2da_0 - libgoogle-cloud=2.12.0=h19a6dae_3 - libgpg-error=1.47=h71f35ed_0 - libgrpc=1.58.2=he06187c_0 - libhwloc=2.9.3=default_h554bfaf_1009 - libiconv=1.17=hd590300_2 - libidn2=2.3.7=hd590300_0 - libjpeg-turbo=3.0.0=hd590300_1 - liblapack=3.9.0=16_linux64_mkl - liblapacke=3.9.0=16_linux64_mkl - libleidenalg=0.11.1=h00ab1b0_0 - libllvm14=14.0.6=hcd5def8_4 - libllvm15=15.0.7=hb3ce162_4 - libnghttp2=1.58.0=h47da74e_1 - libnpp=12.0.2.50=0 - libnsl=2.0.1=hd590300_0 - libnvjitlink=12.1.105=0 - libnvjpeg=12.1.1.14=0 - libogg=1.3.4=h7f98852_1 - libopenvino=2023.2.0=h2e90f83_2 - libopenvino-auto-batch-plugin=2023.2.0=h59595ed_2 - libopenvino-auto-plugin=2023.2.0=hd5fc58b_2 - libopenvino-hetero-plugin=2023.2.0=h3ecfda7_2 - libopenvino-intel-cpu-plugin=2023.2.0=h2e90f83_2 - libopenvino-intel-gpu-plugin=2023.2.0=h2e90f83_2 - libopenvino-ir-frontend=2023.2.0=h3ecfda7_2 - libopenvino-onnx-frontend=2023.2.0=hab2db56_2 - libopenvino-paddle-frontend=2023.2.0=hab2db56_2 - libopenvino-pytorch-frontend=2023.2.0=h59595ed_2 - libopenvino-tensorflow-frontend=2023.2.0=h8d4807b_2 - libopenvino-tensorflow-lite-frontend=2023.2.0=h59595ed_2 - libopus=1.3.1=h7f98852_1 - libpciaccess=0.17=h166bdaf_0 - libpng=1.6.42=h2797004_0 - libpq=16.2=h33b98f1_0 - libprotobuf=4.24.3=hf27288f_1 - libre2-11=2023.06.02=h7a70373_0 - libsndfile=1.2.2=hc60ed4a_1 - libsodium=1.0.18=h516909a_1 - libsqlite=3.45.1=h2797004_0 - libssh2=1.11.0=h0841786_0 - libstdcxx-ng=13.2.0=h7e041cc_5 - libsystemd0=255=h3516f8a_0 - libtasn1=4.19.0=h166bdaf_0 - libtiff=4.6.0=ha9c0a0a_2 - libunistring=0.9.10=h14c3975_0 - libuuid=2.38.1=h0b41bf4_0 - libva=2.20.0=hd590300_0 - libvorbis=1.3.7=he1b5a44_0 - libvpx=1.13.1=h59595ed_0 - libwebp-base=1.3.2=hd590300_0 - libxcb=1.15=h0b41bf4_0 - libxcrypt=4.4.36=hd590300_1 - libxkbcommon=1.6.0=hd429924_1 - libxml2=2.12.5=h232c23b_0 - libzlib=1.2.13=hd590300_5 - lightning=2.0.9.post0=pyhd8ed1ab_0 - lightning-cloud=0.5.64=pyhd8ed1ab_0 - lightning-utilities=0.10.1=pyhd8ed1ab_0 - llvm-openmp=15.0.7=h0cdce71_0 - llvmlite=0.42.0=py39h174d805_1 - lz4-c=1.9.4=hcb278e6_0 - markdown-it-py=3.0.0=pyhd8ed1ab_0 - markupsafe=2.1.5=py39hd1e30aa_0 - matplotlib=3.8.2=py39hf3d152e_0 - matplotlib-base=3.8.2=py39he9076e7_0 - mdurl=0.1.2=pyhd8ed1ab_0 - mistune=3.0.2=pyhd8ed1ab_0 - mkl=2022.1.0=h84fe81f_915 - mkl-devel=2022.1.0=ha770c72_916 - mkl-include=2022.1.0=h84fe81f_915 - ml-collections=0.1.1=pyhd8ed1ab_0 - ml_dtypes=0.3.2=py39hddac248_0 - more-itertools=10.2.0=pyhd8ed1ab_0 - mpc=1.3.1=hfe3b2da_0 - mpfr=4.2.1=h9458935_0 - mpg123=1.32.4=h59595ed_0 - mpmath=1.3.0=pyhd8ed1ab_0 - msgpack-python=1.0.7=py39h7633fee_0 - mudata=0.2.3=pyhd8ed1ab_0 - multidict=6.0.5=py39hd1e30aa_0 - multipledispatch=0.6.0=py_0 - munkres=1.1.4=pyh9f0ad1d_0 - mysql-common=8.0.33=hf1915f5_6 - mysql-libs=8.0.33=hca2cd23_6 - natsort=8.4.0=pyhd8ed1ab_0 - nbclient=0.8.0=pyhd8ed1ab_0 - nbconvert=7.16.0=pyhd8ed1ab_0 - nbconvert-core=7.16.0=pyhd8ed1ab_0 - nbconvert-pandoc=7.16.0=pyhd8ed1ab_0 - nbformat=5.9.2=pyhd8ed1ab_0 - nccl=2.20.3.1=h6103f9b_0 - ncurses=6.4=h59595ed_2 - nest-asyncio=1.6.0=pyhd8ed1ab_0 - nettle=3.9.1=h7ab15ed_0 - networkx=3.2.1=pyhd8ed1ab_0 - nspr=4.35=h27087fc_0 - nss=3.97=h1d7d5a4_0 - numba=0.59.0=py39h615d6bd_1 - numpy=1.26.4=py39h474f0d3_0 - numpyro=0.13.2=pyhd8ed1ab_1 - ocl-icd=2.3.2=hd590300_0 - ocl-icd-system=1.0.0=1 - openh264=2.4.0=h59595ed_0 - openjpeg=2.5.0=h488ebb8_3 - openpyxl=3.1.2=py39hd1e30aa_1 - openssl=3.2.1=hd590300_0 - opt-einsum=3.3.0=hd8ed1ab_2 - opt_einsum=3.3.0=pyhc1e730c_2 - optax=0.1.9=pyhd8ed1ab_0 - orbax-checkpoint=0.4.4=pyhd8ed1ab_0 - ordered-set=4.1.0=pyhd8ed1ab_0 - orjson=3.9.10=py39h10b2342_0 - p11-kit=0.24.1=hc5aa10d_0 - packaging=23.2=pyhd8ed1ab_0 - pandas=2.2.0=py39hddac248_0 - pandoc=3.1.11.1=ha770c72_0 - pandocfilters=1.5.0=pyhd8ed1ab_0 - pathlib2=2.3.7.post1=py39hf3d152e_3 - patsy=0.5.6=pyhd8ed1ab_0 - pcre2=10.42=hcad00b1_0 - pexpect=4.9.0=pyhd8ed1ab_0 - pillow=10.2.0=py39had0adad_0 - pip=24.0=pyhd8ed1ab_0 - pixman=0.43.2=h59595ed_0 - pkginfo=1.9.6=pyhd8ed1ab_0 - pkgutil-resolve-name=1.3.10=pyhd8ed1ab_1 - platformdirs=3.11.0=pyhd8ed1ab_0 - ply=3.11=py_1 - poetry=1.7.1=linux_pyha804496_0 - poetry-core=1.8.1=pyhd8ed1ab_0 - poetry-plugin-export=1.6.0=pyhd8ed1ab_0 - protobuf=4.24.3=py39h60f6b12_1 - psutil=5.9.8=py39hd1e30aa_0 - pthread-stubs=0.4=h36c2ea0_1001 - ptyprocess=0.7.0=pyhd3deb0d_0 - pugixml=1.14=h59595ed_0 - pulseaudio-client=16.1=hb77b528_5 - pybind11-abi=4=hd8ed1ab_3 - pycparser=2.21=pyhd8ed1ab_0 - pydantic=2.1.1=pyhd8ed1ab_0 - pydantic-core=2.4.0=py39h9fdd4d6_0 - pygments=2.17.2=pyhd8ed1ab_0 - pyjwt=2.8.0=pyhd8ed1ab_1 - pynndescent=0.5.11=pyhca7485f_0 - pyopenssl=24.0.0=pyhd8ed1ab_0 - pyparsing=3.1.1=pyhd8ed1ab_0 - pyproject_hooks=1.0.0=pyhd8ed1ab_0 - pyqt=5.15.9=py39h52134e7_5 - pyqt5-sip=12.12.2=py39h3d6467e_5 - pyro-api=0.1.2=pyhd8ed1ab_0 - pyro-ppl=1.8.6=pyhd8ed1ab_0 - pysocks=1.7.1=pyha2e5f31_6 - python=3.9.18=h0755675_1_cpython - python-build=1.0.3=pyhd8ed1ab_0 - python-dateutil=2.8.2=pyhd8ed1ab_0 - python-editor=1.0.4=py_0 - python-fastjsonschema=2.19.1=pyhd8ed1ab_0 - python-flatbuffers=23.5.26=pyhd8ed1ab_0 - python-igraph=0.11.3=py39h007bc96_1 - python-installer=0.7.0=pyhd8ed1ab_0 - python-multipart=0.0.9=pyhd8ed1ab_0 - python-tzdata=2024.1=pyhd8ed1ab_0 - python_abi=3.9=4_cp39 - pytorch=2.1.0=py3.9_cuda12.1_cudnn8.9.2_0 - pytorch-cuda=12.1=ha16c6d3_5 - pytorch-lightning=2.1.3=pyhd8ed1ab_0 - pytorch-mutex=1.0=cuda - pytz=2024.1=pyhd8ed1ab_0 - pyyaml=6.0.1=py39hd1e30aa_1 - pyzmq=25.1.2=py39h8c080ef_0 - qt-main=5.15.8=h450f30e_18 - rapidfuzz=3.6.1=py39h3d6467e_0 - rav1e=0.6.6=he8a937b_2 - re2=2023.06.02=h2873b5e_0 - readchar=4.0.5=pyhd8ed1ab_0 - readline=8.2=h8228510_1 - referencing=0.33.0=pyhd8ed1ab_0 - requests=2.31.0=pyhd8ed1ab_0 - requests-toolbelt=1.0.0=pyhd8ed1ab_0 - rich=13.7.0=pyhd8ed1ab_0 - rpds-py=0.17.1=py39h9fdd4d6_0 - s3fs=0.4.2=py_0 - s3transfer=0.10.0=pyhd8ed1ab_0 - scanpy=1.9.8=pyhd8ed1ab_0 - scikit-learn=1.4.0=py39ha22ef79_0 - scipy=1.12.0=py39h474f0d3_2 - scvi-tools=1.0.4=pyhd8ed1ab_0 - seaborn=0.13.2=hd8ed1ab_0 - seaborn-base=0.13.2=pyhd8ed1ab_0 - secretstorage=3.3.3=py39hf3d152e_2 - session-info=1.0.0=pyhd8ed1ab_0 - setuptools=69.0.3=pyhd8ed1ab_0 - shellingham=1.5.4=pyhd8ed1ab_0 - sip=6.7.12=py39h3d6467e_0 - six=1.16.0=pyh6c4a22f_0 - snappy=1.1.10=h9fff704_0 - sniffio=1.3.0=pyhd8ed1ab_0 - soupsieve=2.5=pyhd8ed1ab_1 - sparse=0.15.1=pyhd8ed1ab_1 - sqlalchemy=2.0.27=py39hd1e30aa_0 - starlette=0.36.3=pyhd8ed1ab_0 - starsessions=1.3.0=pyhd8ed1ab_0 - statsmodels=0.14.1=py39h44dd56e_0 - stdlib-list=0.10.0=pyhd8ed1ab_0 - svt-av1=1.8.0=h59595ed_0 - sympy=1.12=pypyh9d50eac_103 - tbb=2021.11.0=h00ab1b0_1 - tensorstore=0.1.44=py39h5f15eca_3 - texttable=1.7.0=pyhd8ed1ab_0 - threadpoolctl=3.2.0=pyha21a80b_0 - tiledb=2.19.1=h4386cac_0 - tiledb-py=0.25.0=py39h6cb668e_1 - tinycss2=1.2.1=pyhd8ed1ab_0 - tk=8.6.13=noxft_h4845f30_101 - toml=0.10.2=pyhd8ed1ab_0 - tomli=2.0.1=pyhd8ed1ab_0 - tomlkit=0.12.3=pyha770c72_0 - toolz=0.12.1=pyhd8ed1ab_0 - torchaudio=2.1.0=py39_cu121 - torchmetrics=1.2.1=pyhd8ed1ab_0 - torchtriton=2.1.0=py39 - torchvision=0.16.0=py39_cu121 - tornado=6.3.3=py39hd1e30aa_1 - tqdm=4.66.2=pyhd8ed1ab_0 - traitlets=5.14.1=pyhd8ed1ab_0 - trove-classifiers=2024.1.31=pyhd8ed1ab_0 - types-python-dateutil=2.8.19.20240106=pyhd8ed1ab_0 - typing-extensions=4.9.0=hd8ed1ab_0 - typing_extensions=4.9.0=pyha770c72_0 - tzdata=2024a=h0c530f3_0 - umap-learn=0.5.5=py39hf3d152e_1 - unicodedata2=15.1.0=py39hd1e30aa_0 - urllib3=1.26.18=pyhd8ed1ab_0 - uvicorn=0.27.1=py39hf3d152e_0 - virtualenv=20.25.0=pyhd8ed1ab_0 - wcwidth=0.2.13=pyhd8ed1ab_0 - webencodings=0.5.1=pyhd8ed1ab_2 - websocket-client=1.7.0=pyhd8ed1ab_0 - websockets=12.0=py39hd1e30aa_0 - werkzeug=3.0.1=pyhd8ed1ab_0 - wheel=0.42.0=pyhd8ed1ab_0 - x264=1!164.3095=h166bdaf_2 - x265=3.5=h924138e_3 - xarray=2024.1.1=pyhd8ed1ab_0 - xcb-util=0.4.0=hd590300_1 - xcb-util-image=0.4.0=h8ee46fc_1 - xcb-util-keysyms=0.4.0=h8ee46fc_1 - xcb-util-renderutil=0.3.9=hd590300_1 - xcb-util-wm=0.4.1=h8ee46fc_1 - xkeyboard-config=2.41=hd590300_0 - xlrd=1.2.0=pyh9f0ad1d_1 - xmltodict=0.13.0=pyhd8ed1ab_0 - xorg-fixesproto=5.0=h14c3975_1002 - xorg-kbproto=1.0.7=h14c3975_1002 - xorg-libice=1.1.1=hd590300_0 - xorg-libsm=1.2.4=h7391055_0 - xorg-libx11=1.8.7=h8ee46fc_0 - xorg-libxau=1.0.11=hd590300_0 - xorg-libxdmcp=1.1.3=h516909a_0 - xorg-libxext=1.3.4=h0b41bf4_2 - xorg-libxfixes=5.0.3=h7f98852_1004 - xorg-libxrender=0.9.11=hd590300_0 - xorg-renderproto=0.11.1=h14c3975_1002 - xorg-xextproto=7.3.0=h0b41bf4_1003 - xorg-xf86vidmodeproto=2.3.1=h516909a_1002 - xorg-xproto=7.0.31=h14c3975_1007 - xz=5.2.6=h166bdaf_0 - yaml=0.2.5=h7f98852_2 - yarl=1.9.4=py39hd1e30aa_0 - zeromq=4.3.5=h59595ed_0 - zipp=3.17.0=pyhd8ed1ab_0 - zlib=1.2.13=hd590300_5 - zstd=1.5.5=hfc55251_0 ```

Micromamba decided to install anndata 0.10.3 instead of 0.10.5, hd5py seems up to date. Perhaps the gget file was created with a different anndata version? I have no control over that ....

Since I decompose the dataframes in the original gget adata file to be able to concatenate them into a bigger file, I would think that would help with the fact that the gget adata file might have been created with an older version. However, there might be some issues within some of the components of the gget adata file.

Thanks again for your interest and time!

First off, I think you may want to use: anndata.experimental.concat_on_disk for this task.

If you are creating a file manually, then you need to make sure you are creating something that adheres to the specification for the on disk format.

The downloaded gget adata files (I have checked a few of them) respond like this:

i) Cannot be read with sc.read_hdf() ii) Can be read with sc.read_h5ad()

This seems expected to me, as they are recent anndata files.

I would note that read_hdf and read_h5ad are different and have different expectations of the data.

iii) Do not have the dtype attribute after using the f = h5py.File(sc_path, 'r') command
print(f['obs'].dtype.descr)

I would only expect this for fairly old anndata files, from around when I started contributing (e.g. 5-ish years ago).

What is interesting about the file you shared is that it has the following keys:

$ h5ls -r single_cell_data_dataset0b.h5ad
/                        Group
/X                       Dataset {91588/Inf, 1330/1389}
/col_names               Dataset {1330}
/obs                     Dataset {91588/Inf, 10}
/obs_names               Dataset {10}
/row_names               Dataset {91588/Inf}
/var                     Dataset {4125/Inf, 4}
/var_names               Dataset {4}

I don't really know what to expect here, since I've never seen a file like this (though there is some old code that uses these key names). I'm very curious what you called to write this.

Your code snippet also doesn't seem to add these keys, so I'm wondering where they came from.

@ivirshup Thanks for the suggestions :). Yeah, I definitely need to do more research on handling these large anndata files ... the anndata.experimental.concat_on_disk is new? (since it has the experimental tag) I can have a look at it.

I am new to big anndata files (and docker hehhe)

Sorry about the confusion, to keep it short I did not write the entire concatenating loop and I skipped the lines where I also add the col_names, row_names etc keys. I do exactly the same as for the X key :). I have added a comment

Regarding the naming of the keys, I might have followed some tutorial somewhere heheh

the anndata.experimental.concat_on_disk is new? (since it has the experimental tag) I can have a look at it.

Kinda new! From 0.10 on.

Dask also works well with dense chunks so that could be used here as well.

I might have followed some tutorial somewhere heheh

Do you know where this is from? I'm hoping not us...

Oh, ok thanks!

hahaha, probably some stackoverflow or so, it could have been old though hehhe

Alright, I'm going to close this as I don't think there is any required action on our side. Let me know if you don't think that is the case

scverse / anndata