linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org
BSD 2-Clause "Simplified" License
137 stars 36 forks source link

[python 3.8] UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 45: ordinal not in range(128) #149

Open csgroen opened 3 years ago

csgroen commented 3 years ago

Hello,

I've just made an updated conda environment for python 3.8 and I can't read loom files using anndata.read_loom() anymore. It gives me this error (see full traceback below):

Traceback (most recent call last):

  File "<ipython-input-2-b0b79aae2f29>", line 1, in <module>
    adata = anndata.read_loom('/home/clarice/Documents/SingleCell_PseudoTime/data/CHLA9.loom')

  File "/home/clarice/.local/lib/python3.8/site-packages/anndata/_io/read.py", line 225, in read_loom
    var = dict(lc.row_attrs)

  File "/home/clarice/anaconda3/lib/python3.8/site-packages/loompy/attribute_manager.py", line 102, in __getitem__
    return self.__getattr__(thing)

  File "/home/clarice/anaconda3/lib/python3.8/site-packages/loompy/attribute_manager.py", line 119, in __getattr__
    vals = loompy.materialize_attr_values(self.ds._file[a][name][:])

  File "/home/clarice/anaconda3/lib/python3.8/site-packages/loompy/normalize.py", line 98, in materialize_attr_values
    result = np.array([html.unescape(x) for x in temp.astype(str)], dtype=object)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 45: ordinal not in range(128)

Of note: I can read the same file in my python3.7 environment, but it prints a message:

Variable names are not unique. To make them unique, call '.var_names_make_unique'.

It's always been like this. After running .var_names_make_unique, it all works out perfectly.

Any idea why UnicodeDecoder is failing? Is there anything I can do?

SergejN commented 3 years ago

I have the same issue. Here is the list of packages that are currently installed.

Package        Version
-------------- -------
click          8.0.0
h5py           3.2.1
llvmlite       0.36.0
loompy         3.0.6
numba          0.53.1
numpy          1.20.3
numpy-groupies 0.9.13
pip            21.1.1
scipy          1.6.3
setuptools     56.2.0
wheel          0.36.2

Thanks for any ideas on how to get it to run.

SergejN commented 3 years ago

Ok, I think I got it. It should also take care of #141 . After some hours of debugging I realized that the file gencode.v31.metadata.tab, which I downloaded from https://storage.googleapis.com/linnarsson-lab-www-blobs/human_GRCh38_gencode.v31.tar.gz contains non-ASCII symbols:

[nowoshil@vieccews0302 human_GRCh38_gencode.v31.600]$ grep --color='auto' -P -n "[^\x00-\x7F]" gencode.v31.metadata.tab
33589:ENSG00000175634   ENSG00000175634.15      RPS6KB2 ribosomal protein S6 kinase B2  protein_coding  HGNC:10437      chr11   67428460        67435401   protein-coding gene     gene with protein product       11q13.2 11q13.2 "p70S6Kb|P70-BETA|STK14B|KLS|S6KB|S6Kbeta|S6Kβ" OTTHUMG00000167673uc001old.4       NM_003952       CCDS41677       Q9UBS0  "9878560|9804755"       MGI:1927343     RGD:1305144     RPS6KB2 608939          False
33759:ENSG00000110203   ENSG00000110203.9       FOLR3   folate receptor gamma   protein_coding  HGNC:3795       chr11   72114869        72139892  protein-coding gene      gene with protein product       11q13.4 11q13.4 "FR-G|FRγ"      OTTHUMG00000167870      uc031xur.2      NM_000804       CCDS73344  P41439  8110752                 FOLR3   602469          False
33764:ENSG00000110195   ENSG00000110195.13      FOLR1   folate receptor alpha   protein_coding  HGNC:3791       chr11   72189558        72196323  protein-coding gene      gene with protein product       11q13.4 11q13.4 FRα     OTTHUMG00000167876      uc001osa.3      NM_016725       CCDS8211  P15328   1717147 MGI:95568       RGD:71032       FOLR1   136430          False
33765:ENSG00000165457   ENSG00000165457.14      FOLR2   folate receptor beta    protein_coding  HGNC:3793       chr11   72216601        72221950  protein-coding gene      gene with protein product       11q13.4 11q13.4 FRβ     OTTHUMG00000150394      uc001ose.5      NM_000803       CCDS8212  P14207   "7698003|8110752"       MGI:95569       RGD:1308515     FOLR2   136425          False
44873:ENSG00000166501   ENSG00000166501.14      PRKCB   protein kinase C beta   protein_coding  HGNC:9395       chr16   23835983        24220611  protein-coding gene      gene with protein product       16p12.2-p12.1   16p12.2-p12.1   PKCβ    OTTHUMG00000131615      uc002dmd.4      NM_212535 "CCDS10618|CCDS10619"    P05771  3658678 MGI:97596       RGD:3396        PRKCB   176970          False
49067:ENSG00000154229   ENSG00000154229.12      PRKCA   protein kinase C alpha  protein_coding  HGNC:9393       chr17   66302613        66810743  protein-coding gene      gene with protein product       17q24.2 17q24.2 PKCα    OTTHUMG00000179533      uc002jfp.2      NM_002737       CCDS11664 P17252           MGI:97595       RGD:3395        PRKCA   176960          False
52643:ENSG00000105221   ENSG00000105221.17      AKT2    AKT serine/threonine kinase 2   protein_coding  HGNC:392        chr19   40230317        40285536   protein-coding gene     gene with protein product       19q13.2 19q13.2 PKBβ    OTTHUMG00000137375      uc002onf.3      NM_001626       "CCDS12552|CCDS82350"      P31751  1409633 MGI:104874      RGD:2082        AKT2    164731          False
53513:ENSG00000126583   ENSG00000126583.11      PRKCG   protein kinase C gamma  protein_coding  HGNC:9402       chr19   53879190        53907652  protein-coding gene      gene with protein product       19q13.42        19q13.42        "PKCC|MGC57564|PKCγ"    OTTHUMG00000064846      uc002qcq.2NM_002739        CCDS12867       P05129  "8432525|3755548"       MGI:97597       RGD:3397        PRKCG   176980          False
58592:ENSG00000089289   ENSG00000089289.16      IGBP1   immunoglobulin binding protein 1        protein_coding  HGNC:5461       chrX    70133447  70166324 protein-coding gene     gene with protein product       Xq13.1  Xq13.1  α4      OTTHUMG00000021767      uc004dxv.4      NM_001370192    CCDS14396  P78318  9441740 MGI:1346500     RGD:62011       IGBP1   300139          False
59609:ENSG00000129675   ENSG00000129675.16      ARHGEF6 Rac/Cdc42 guanine nucleotide exchange factor 6  protein_coding  HGNC:685        chrX    136665547  136780932       protein-coding gene     gene with protein product       Xq26.3  Xq26.3  "alphaPIX|Cool-2|KIAA0006|alpha-PIX|Cool2|αPix" OTTHUMG00000022518 uc004fab.5      NM_004840       "CCDS14660|CCDS78509"   Q15052  "7584048|9659915"       MGI:1920591     RGD:1359674     ARHGEF6 300267             False

I played around with the locale settings of my Docker container, but it didn't bring much. I ended up patching the file normalize.py as follows:

--- /usr/local/lib/python3.9/site-packages/loompy/normalize.py  2021-05-17 13:00:47.120228000 +0200
+++ /usr/local/lib/python3.9/site-packages/loompy/normalize.py  2021-05-17 13:00:47.120228000 +0200
@@ -95,7 +95,10 @@
                else:
                        temp = a
                # Then unescape XML entities and convert to unicode
-               result = np.array([html.unescape(x) for x in temp.astype(str)], dtype=object)
+               try:
+                       result = np.array([html.unescape(x) for x in temp.astype(str)], dtype=object)
+               except:
+                       result = np.array([html.unescape(x.decode("utf-8")) for x in temp], dtype=object)
        elif np.issubdtype(a.dtype, np.str_) or np.issubdtype(a.dtype, np.unicode_):
                result = np.array(a.astype(str), dtype=object)
        else:

I'm not sure how x.decode("utf-8") impacts the performance, therefore, the modified branch is only executed for the few lines above that would otherwise make UnicodeDecoder fail.

stela2502 commented 2 years ago

Sad that this fix has not reached the official package. I had to install from github to get this fix :-(

pip install git+https://github.com/linnarsson-lab/loompy.git
slinnarsson commented 2 years ago

If you make a pull request I'm happy to accept it

stela2502 commented 2 years ago

The fix is in your git code, but this is too new to be installed with "pip install loompy". So I likely just need to wait. Until then an install from git is sufficient to fix the problem. No pull request necessary any more. But Thank you!