Closed stela2502 closed 2 years ago
ARGHH - Would likely help to use the right ToZarr function !!
writer = scarf.LoomToZarr
But now I found another bug that only recently has been cleared out in loompy, too:
UnicodeDecodeError Traceback (most recent call last)
Hm, right off the bat it seems that there might be a selection of sources this can arise from. I only looked quickly at the issue you linked to, but you mentioned there the right code existing in the repository as opposed to the PiPy version of loompy, do you also know where in the code that would be? Might help narrow down the possibilities. Also, just to make sure, have you double-checked that your loom file is generated correctly? There are no actual non-ascii characters in there?
Have you checked the github link.
It is something about loom being able to save UTF-8 strings and python freaks out if it tries to store an 'alpha' sign as ascii. To the fix they did in loompy is around line 98 in the normalize.py:
if np.issubdtype(a.dtype, np.string_) or np.issubdtype(a.dtype, np.object_):
# First ensure that what we load is valid ascii (i.e. ignore anything outside 7-bit range)
if hasattr(a, "decode"): # This takes care of Loom files that store strings as UTF8, which comes in as str and doesn't have a decode method
temp = np.array([x.decode('ascii', 'ignore') for x in a])
else:
temp = a
# Then unescape XML entities and convert to unicode
try:
result = np.array([html.unescape(x) for x in temp.astype(str)], dtype=object)
except: # Dirty hack to handle UTF-8 non-break-space in scalar strings. TODO: Rewrite this whole method completely!
if type(a[0]) == np.bytes_:
result=a0replaceb'\xc2\xa0'b''
result: np.ndarray = None # This second clause takes care of attributes stored as variable-length ascii, which can be generated by loomR or Seurat
Hope this helps.
The non ascii characters are in the gene annotation table. I would also be absolutely OK if it would silently replace the string with some other crap sign as Excel did it during the days. But failing is annoying.
Hi now I fixed the problem myself - a bloody hack as I just dropped data until I clear the issue:
h5 = h5py.File( ifile , mode="r+")
del h5['row_attrs']['Aliases']
h5.close()
Was the last one I deleted before the file did work.
reader = scarf.readers.LoomReader( ifile, cell_names_key='CellID', feature_names_key="Gene")
writer = scarf.LoomToZarr(
reader,
zarr_fn='scarf_datasets/data/data.zarr',
chunk_size=(2000, 1000),
assay_name = "spliced"
)
writer.dump(batch_size=1000)
Hi @stela2502,
Nice that you were able to find a solution for your data.
Somehow this issue did not crop up in our tests. Most likely due to the new convention of names that Loompy is using. I will follow this up by trying to perform conversion to Zarr using loom files generated by recent versions of kallisto/loompy.
Let's keep this issue open for now until this bug is fixed.
Hi Parasha,
I am not 100% sure that this file is not totally crappy. I might have a lot of duplicates in there. Let's see where I end up here. By the way - not really new version of kallisto/bustools. I have used files I created about a year ago.
Closing this for now.
Hi - I am using Scarf version '0.18.2' - the one installed in LS2 in the SingSingCell/1.1 singularity image.
I did this:
And I got this error:
AttributeError Traceback (most recent call last)