read_h5ad and Python 3.9: all strings are now bytes?

iosonofabio commented 3 years ago

Hi all,

Thanks for the amazing package!

Just updated to Python 3.9 since numba has fixed their side last month. Most adata strings (e.g. var_names, obs_names, and all column contents in the respective dataframes) are now parsed as bytes:

import anndata
adata = anndata.read_h5ad(path)
print(adata.var_names)

Index([b'AL627309.1', b'AL627309.3', b'AL669831.5',     b'FAM87B',
        b'LINC00115',     b'FAM41C', b'AL645608.7', b'AL645608.3',
       b'AL645608.5', b'AL645608.1',
       ...
           b'MT-CYB', b'BX004987.1', b'AC145212.1',      b'MAFIP',
       b'AC011043.1', b'AL592183.1', b'AC007325.4', b'AL354822.1',
       b'AC004556.1', b'AC240274.1'],
      dtype='object', name='GeneName', length=22896)

I skimmed through anndata's code and found there is already some fiddling with string encoding, so I suspect something needs fixing there (read_series or thereabout).

Of note, the names of the columns of both adata.var and adata.obs are correctly parsed as strings, not bytes. Not sure why that would be, one would expect them to undergo the same treatment as the metadata itself?

Thank you in advance, Fabio

edit: that seems related to this change in pandas 1.2:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

encodingstr, optional

    Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings . .. versionchanged:: 1.2
    When encoding is None, errors="replace" is passed to open(). Otherwise, errors="strict" is passed to open(). This behavior was previously only the case for engine="python".

ivirshup commented 3 years ago

Could you give us a bit more information about your environment? Something like the output of

from sinfo import sinfo
sinfo(dependencies=True)

This looks a lot like some issues we'd been having with h5py 3.x vs 2.x, but to the best of my knowledge those had been fixed in recent releases of anndata.

On my machine, in an environment created with:

conda create -yn python3.9 python=3.9
conda activate python3.9
pip install anndata sinfo

>>> import anndata
>>> from sinfo import sinfo
>>> adata = anndata.read_h5ad("./tmp.h5ad")
>>> adata.var_names
Index(['Rp1', 'Sox17', 'Lypla1', 'Gm37988', 'Tcea1', 'Rgs20', 'Atp6v1h',
       'Rb1cc1', '4732440D04Rik', 'St18',
       ...
       'Uty', 'Ddx3y', 'Dcc', 'Gm960', 'Slc22a12', 'Ptgdr2', 'Slit1', 'Sec31b',
       'E330013P04Rik', 'cdh5-Tdtomato'],
      dtype='object', name='index', length=17132)
>>> sinfo(dependencies=True)
-----
anndata     0.7.5
sinfo       0.3.1
-----
anndata             0.7.5
cython_runtime      NA
dateutil            2.8.1
h5py                3.1.0
natsort             7.1.1
numpy               1.20.1
packaging           20.9
pandas              1.2.2
pytz                2021.1
scipy               1.6.0
sinfo               0.3.1
six                 1.15.0
-----
Python 3.9.1 (default, Dec 11 2020, 06:28:49) [Clang 10.0.0 ]
macOS-10.15.7-x86_64-i386-64bit
16 logical CPU cores, i386
-----
Session information updated at 2021-02-10 14:26

iosonofabio commented 3 years ago

Thank you:

import anndata
import sinfo
ModuleNotFoundError: No module named 'sinfo'

Not using conda, if that was your question.

Might this be useful?

anndata.__version__
'0.7.4'

iosonofabio commented 3 years ago

Yep, it's fixed in 0.7.5, closing, thank you.

ivirshup commented 3 years ago

Thanks for the update, glad the issue is fixed!

liuzj039 commented 3 years ago

It seems to have reappeared when upgrading the h5py to version 3.3.

ivirshup commented 3 years ago

@liuzj039, I'm not seeing this behavior with h5py 3.3. Could you open a new issue with a replicable example of what you're seeing?

liuzj039 commented 3 years ago

@liuzj039, I'm not seeing this behavior with h5py 3.3. Could you open a new issue with a replicable example of what you're seeing?

Oh, and when I downgraded my h5py to 3.1 and then upgraded to 3.3, it was fixed. It seems to be caused by my terrible environment. Many thx!

orrzor commented 3 years ago

As an aside, I was having a similar issue in a different context, but going from AnnData 0.7.4 to 0.7.6 seems to have fixed it. Thank you.

scverse / anndata

read_h5ad and Python 3.9: all strings are now bytes? #505