parashardhapola / scarf

Toolkit for highly memory efficient analysis of single-cell RNA-Seq, scATAC-Seq and CITE-Seq data. Analyze atlas scale datasets with millions of cells on laptop.
http://scarf.readthedocs.io
BSD 3-Clause "New" or "Revised" License
92 stars 12 forks source link

loom file from kallisto/bustools import error: #61

Closed stela2502 closed 2 years ago

stela2502 commented 2 years ago

Hi - I am using Scarf version '0.18.2' - the one installed in LS2 in the SingSingCell/1.1 singularity image.

I did this:

reader = scarf.readers.LoomReader( ifile, )
writer = scarf.CrToZarr(
    reader,
    zarr_fn='scarf_datasets/data/data.zarr',
    chunk_size=(2000, 1000)
)
writer.dump(batch_size=1000)

And I got this error:

AttributeError Traceback (most recent call last)

in ----> 1 writer = scarf.CrToZarr( 2 reader, 3 zarr_fn='scarf_datasets/HongzheAndPavan/data.zarr', 4 chunk_size=(2000, 1000) 5 ) /usr/local/lib/python3.8/dist-packages/scarf/writers.py in __init__(self, cr, zarr_fn, chunk_size, dtype) 184 self.z = zarr.open(self.fn, mode="w") 185 self._ini_cell_data() --> 186 for assay_name in self.cr.assayFeats.columns: 187 create_zarr_count_assay( 188 self.z, AttributeError: 'LoomReader' object has no attribute 'assayFeats' After some research I assume that the class CrToZarr does not call the scarf.readers.LoomReader correctly - is that correct or is my loom file problematic? I'll look into that some more.
stela2502 commented 2 years ago

ARGHH - Would likely help to use the right ToZarr function !!

writer = scarf.LoomToZarr

But now I found another bug that only recently has been cleared out in loompy, too:


UnicodeDecodeError Traceback (most recent call last)

in ----> 1 writer = scarf.LoomToZarr( 2 reader, 3 zarr_fn='scarf_datasets/HongzheAndPavan/data.zarr', 4 chunk_size=(2000, 1000) 5 ) /usr/local/lib/python3.8/dist-packages/scarf/writers.py in __init__(self, loom, zarr_fn, assay_name, chunk_size) 501 ) 502 for i, j in self.loom.get_feature_attrs(): --> 503 create_zarr_obj_array(self.z[self.assayName]["featureData"], i, j, j.dtype) 504 505 def _ini_cell_data(self): /usr/local/lib/python3.8/dist-packages/scarf/writers.py in create_zarr_obj_array(g, name, data, dtype, overwrite) 111 data = data.astype("U") 112 dtype = data.dtype --> 113 return g.create_dataset( 114 name, 115 data=data, /usr/local/lib/python3.8/dist-packages/zarr/hierarchy.py in create_dataset(self, name, **kwargs) 804 """ 805 --> 806 return self._write_op(self._create_dataset_nosync, name, **kwargs) 807 808 def _create_dataset_nosync(self, name, data=None, **kwargs): /usr/local/lib/python3.8/dist-packages/zarr/hierarchy.py in _write_op(self, f, *args, **kwargs) 659 660 with lock: --> 661 return f(*args, **kwargs) 662 663 def create_group(self, name, overwrite=False): /usr/local/lib/python3.8/dist-packages/zarr/hierarchy.py in _create_dataset_nosync(self, name, data, **kwargs) 820 821 else: --> 822 a = array(data, store=self._store, path=path, chunk_store=self._chunk_store, 823 **kwargs) 824 /usr/local/lib/python3.8/dist-packages/zarr/creation.py in array(data, **kwargs) 355 356 # fill with data --> 357 z[...] = data 358 359 # set read_only property afterwards /usr/local/lib/python3.8/dist-packages/zarr/core.py in __setitem__(self, selection, value) 1119 1120 fields, selection = pop_fields(selection) -> 1121 self.set_basic_selection(selection, value, fields=fields) 1122 1123 def set_basic_selection(self, selection, value, fields=None): /usr/local/lib/python3.8/dist-packages/zarr/core.py in set_basic_selection(self, selection, value, fields) 1214 return self._set_basic_selection_zd(selection, value, fields=fields) 1215 else: -> 1216 return self._set_basic_selection_nd(selection, value, fields=fields) 1217 1218 def set_orthogonal_selection(self, selection, value, fields=None): /usr/local/lib/python3.8/dist-packages/zarr/core.py in _set_basic_selection_nd(self, selection, value, fields) 1505 indexer = BasicIndexer(selection, self) 1506 -> 1507 self._set_selection(indexer, value, fields=fields) 1508 1509 def _set_selection(self, indexer, value, fields=None): /usr/local/lib/python3.8/dist-packages/zarr/core.py in _set_selection(self, indexer, value, fields) 1554 1555 # put data -> 1556 self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields) 1557 1558 def _process_chunk(self, out, cdata, chunk_selection, drop_axes, /usr/local/lib/python3.8/dist-packages/zarr/core.py in _chunk_setitem(self, chunk_coords, chunk_selection, value, fields) 1701 1702 with lock: -> 1703 self._chunk_setitem_nosync(chunk_coords, chunk_selection, value, 1704 fields=fields) 1705 /usr/local/lib/python3.8/dist-packages/zarr/core.py in _chunk_setitem_nosync(self, chunk_coords, chunk_selection, value, fields) 1760 chunk[fields][chunk_selection] = value 1761 else: -> 1762 chunk[chunk_selection] = value 1763 1764 # encode chunk UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 45: ordinal not in range(128)
stela2502 commented 2 years ago

https://github.com/linnarsson-lab/loompy/issues/149

razofz commented 2 years ago

Hm, right off the bat it seems that there might be a selection of sources this can arise from. I only looked quickly at the issue you linked to, but you mentioned there the right code existing in the repository as opposed to the PiPy version of loompy, do you also know where in the code that would be? Might help narrow down the possibilities. Also, just to make sure, have you double-checked that your loom file is generated correctly? There are no actual non-ascii characters in there?

stela2502 commented 2 years ago

Have you checked the github link.

It is something about loom being able to save UTF-8 strings and python freaks out if it tries to store an 'alpha' sign as ascii. To the fix they did in loompy is around line 98 in the normalize.py:

if np.issubdtype(a.dtype, np.string_) or np.issubdtype(a.dtype, np.object_):
# First ensure that what we load is valid ascii (i.e. ignore anything outside 7-bit range)
if hasattr(a, "decode"): # This takes care of Loom files that store strings as UTF8, which comes in as str and doesn't have a decode method
temp = np.array([x.decode('ascii', 'ignore') for x in a])
else:
temp = a
# Then unescape XML entities and convert to unicode
try:
result = np.array([html.unescape(x) for x in temp.astype(str)], dtype=object)
except: # Dirty hack to handle UTF-8 non-break-space in scalar strings. TODO: Rewrite this whole method completely!
if type(a[0]) == np.bytes_:
result=a0replaceb'\xc2\xa0'b''
result: np.ndarray = None # This second clause takes care of attributes stored as variable-length ascii, which can be generated by loomR or Seurat

Hope this helps.

stela2502 commented 2 years ago

The non ascii characters are in the gene annotation table. I would also be absolutely OK if it would silently replace the string with some other crap sign as Excel did it during the days. But failing is annoying.

stela2502 commented 2 years ago

Hi now I fixed the problem myself - a bloody hack as I just dropped data until I clear the issue:

h5 = h5py.File( ifile , mode="r+")
del h5['row_attrs']['Aliases']
h5.close()

Was the last one I deleted before the file did work.

reader = scarf.readers.LoomReader( ifile, cell_names_key='CellID', feature_names_key="Gene")

writer = scarf.LoomToZarr(
    reader,
    zarr_fn='scarf_datasets/data/data.zarr',
    chunk_size=(2000, 1000),
    assay_name = "spliced"
)
writer.dump(batch_size=1000)
parashardhapola commented 2 years ago

Hi @stela2502,

Nice that you were able to find a solution for your data.

Somehow this issue did not crop up in our tests. Most likely due to the new convention of names that Loompy is using. I will follow this up by trying to perform conversion to Zarr using loom files generated by recent versions of kallisto/loompy.

Let's keep this issue open for now until this bug is fixed.

stela2502 commented 2 years ago

Hi Parasha,

I am not 100% sure that this file is not totally crappy. I might have a lot of duplicates in there. Let's see where I end up here. By the way - not really new version of kallisto/bustools. I have used files I created about a year ago.

parashardhapola commented 2 years ago

Closing this for now.