Open csgroen opened 3 years ago
I have the same issue. Here is the list of packages that are currently installed.
Package Version
-------------- -------
click 8.0.0
h5py 3.2.1
llvmlite 0.36.0
loompy 3.0.6
numba 0.53.1
numpy 1.20.3
numpy-groupies 0.9.13
pip 21.1.1
scipy 1.6.3
setuptools 56.2.0
wheel 0.36.2
Thanks for any ideas on how to get it to run.
Ok, I think I got it. It should also take care of #141 .
After some hours of debugging I realized that the file gencode.v31.metadata.tab
, which I downloaded from https://storage.googleapis.com/linnarsson-lab-www-blobs/human_GRCh38_gencode.v31.tar.gz
contains non-ASCII symbols:
[nowoshil@vieccews0302 human_GRCh38_gencode.v31.600]$ grep --color='auto' -P -n "[^\x00-\x7F]" gencode.v31.metadata.tab
33589:ENSG00000175634 ENSG00000175634.15 RPS6KB2 ribosomal protein S6 kinase B2 protein_coding HGNC:10437 chr11 67428460 67435401 protein-coding gene gene with protein product 11q13.2 11q13.2 "p70S6Kb|P70-BETA|STK14B|KLS|S6KB|S6Kbeta|S6Kβ" OTTHUMG00000167673uc001old.4 NM_003952 CCDS41677 Q9UBS0 "9878560|9804755" MGI:1927343 RGD:1305144 RPS6KB2 608939 False
33759:ENSG00000110203 ENSG00000110203.9 FOLR3 folate receptor gamma protein_coding HGNC:3795 chr11 72114869 72139892 protein-coding gene gene with protein product 11q13.4 11q13.4 "FR-G|FRγ" OTTHUMG00000167870 uc031xur.2 NM_000804 CCDS73344 P41439 8110752 FOLR3 602469 False
33764:ENSG00000110195 ENSG00000110195.13 FOLR1 folate receptor alpha protein_coding HGNC:3791 chr11 72189558 72196323 protein-coding gene gene with protein product 11q13.4 11q13.4 FRα OTTHUMG00000167876 uc001osa.3 NM_016725 CCDS8211 P15328 1717147 MGI:95568 RGD:71032 FOLR1 136430 False
33765:ENSG00000165457 ENSG00000165457.14 FOLR2 folate receptor beta protein_coding HGNC:3793 chr11 72216601 72221950 protein-coding gene gene with protein product 11q13.4 11q13.4 FRβ OTTHUMG00000150394 uc001ose.5 NM_000803 CCDS8212 P14207 "7698003|8110752" MGI:95569 RGD:1308515 FOLR2 136425 False
44873:ENSG00000166501 ENSG00000166501.14 PRKCB protein kinase C beta protein_coding HGNC:9395 chr16 23835983 24220611 protein-coding gene gene with protein product 16p12.2-p12.1 16p12.2-p12.1 PKCβ OTTHUMG00000131615 uc002dmd.4 NM_212535 "CCDS10618|CCDS10619" P05771 3658678 MGI:97596 RGD:3396 PRKCB 176970 False
49067:ENSG00000154229 ENSG00000154229.12 PRKCA protein kinase C alpha protein_coding HGNC:9393 chr17 66302613 66810743 protein-coding gene gene with protein product 17q24.2 17q24.2 PKCα OTTHUMG00000179533 uc002jfp.2 NM_002737 CCDS11664 P17252 MGI:97595 RGD:3395 PRKCA 176960 False
52643:ENSG00000105221 ENSG00000105221.17 AKT2 AKT serine/threonine kinase 2 protein_coding HGNC:392 chr19 40230317 40285536 protein-coding gene gene with protein product 19q13.2 19q13.2 PKBβ OTTHUMG00000137375 uc002onf.3 NM_001626 "CCDS12552|CCDS82350" P31751 1409633 MGI:104874 RGD:2082 AKT2 164731 False
53513:ENSG00000126583 ENSG00000126583.11 PRKCG protein kinase C gamma protein_coding HGNC:9402 chr19 53879190 53907652 protein-coding gene gene with protein product 19q13.42 19q13.42 "PKCC|MGC57564|PKCγ" OTTHUMG00000064846 uc002qcq.2NM_002739 CCDS12867 P05129 "8432525|3755548" MGI:97597 RGD:3397 PRKCG 176980 False
58592:ENSG00000089289 ENSG00000089289.16 IGBP1 immunoglobulin binding protein 1 protein_coding HGNC:5461 chrX 70133447 70166324 protein-coding gene gene with protein product Xq13.1 Xq13.1 α4 OTTHUMG00000021767 uc004dxv.4 NM_001370192 CCDS14396 P78318 9441740 MGI:1346500 RGD:62011 IGBP1 300139 False
59609:ENSG00000129675 ENSG00000129675.16 ARHGEF6 Rac/Cdc42 guanine nucleotide exchange factor 6 protein_coding HGNC:685 chrX 136665547 136780932 protein-coding gene gene with protein product Xq26.3 Xq26.3 "alphaPIX|Cool-2|KIAA0006|alpha-PIX|Cool2|αPix" OTTHUMG00000022518 uc004fab.5 NM_004840 "CCDS14660|CCDS78509" Q15052 "7584048|9659915" MGI:1920591 RGD:1359674 ARHGEF6 300267 False
I played around with the locale settings of my Docker container, but it didn't bring much. I ended up patching the file normalize.py
as follows:
--- /usr/local/lib/python3.9/site-packages/loompy/normalize.py 2021-05-17 13:00:47.120228000 +0200
+++ /usr/local/lib/python3.9/site-packages/loompy/normalize.py 2021-05-17 13:00:47.120228000 +0200
@@ -95,7 +95,10 @@
else:
temp = a
# Then unescape XML entities and convert to unicode
- result = np.array([html.unescape(x) for x in temp.astype(str)], dtype=object)
+ try:
+ result = np.array([html.unescape(x) for x in temp.astype(str)], dtype=object)
+ except:
+ result = np.array([html.unescape(x.decode("utf-8")) for x in temp], dtype=object)
elif np.issubdtype(a.dtype, np.str_) or np.issubdtype(a.dtype, np.unicode_):
result = np.array(a.astype(str), dtype=object)
else:
I'm not sure how x.decode("utf-8")
impacts the performance, therefore, the modified branch is only executed for the few lines above that would otherwise make UnicodeDecoder fail.
Sad that this fix has not reached the official package. I had to install from github to get this fix :-(
pip install git+https://github.com/linnarsson-lab/loompy.git
If you make a pull request I'm happy to accept it
The fix is in your git code, but this is too new to be installed with "pip install loompy". So I likely just need to wait. Until then an install from git is sufficient to fix the problem. No pull request necessary any more. But Thank you!
Hello,
I've just made an updated conda environment for python 3.8 and I can't read loom files using
anndata.read_loom()
anymore. It gives me this error (see full traceback below):Of note: I can read the same file in my python3.7 environment, but it prints a message:
Variable names are not unique. To make them unique, call '.var_names_make_unique'.
It's always been like this. After running
.var_names_make_unique
, it all works out perfectly.Any idea why UnicodeDecoder is failing? Is there anything I can do?