Consider changing detections of strings from `is_string_dtype`

ivirshup commented 5 years ago

@flying-sheep

Just found this issue for pandas https://github.com/pandas-dev/pandas/issues/21653, and looked at the code for is_string_dtype and think it probably shouldn't be relied on. Anything whose dtype.kind is O (anything pandas considers an object) returns True. For example:

>>> is_string_dtype(pd.Series([(0,), (1,)]))
True

It's probably not reasonable to assume any column that has objects should be converted to categorical.

Side note: I'm unaware of the rationale for converting all string columns to categorical. Could someone point me towards the reasoning behind this?

flying-sheep commented 5 years ago

I think strings are at least the primary reason for object dtypes to appear. Sadly there’s no variable width unicode string dtype that would make things clearer. But you’re of course right that this detection does not match our assumptions.

The rationale is simply that it takes a bit less memory and operations are faster if there’s a limited number of different categories. E.g. an index/id column with a separate value in each row won’t profit for it, but a cluster_label column with like 5 distinct values does.

ivirshup commented 5 years ago

Definitely strings are the primary, but the current implementation will have unintended consequences. Objects which are "equal", but are not the same object will get modified here.

Just read a couple blog posts about performance benchmarks on this, and it seems pretty convincing. R has just made me very cautious about converting strings to categoricals by default.

LuckyMD commented 5 years ago

When Chuck Norris uses R, stringsAsFactors defaults to False ;)

flying-sheep commented 5 years ago

Yeah it’s true, R factors are fiddly. I’ve oscillated a few times between “strings everywhere” and “factors wherever useful”. Currently I’m in the latter mode, since facet_wrap respects factor level order.

grst commented 4 years ago

I now have a use-case where this becomes an issue: I would like to add a column to obs that contains an object holding T-cell receptor chains for each cell. Unfortunately it gets converted to its string representation when plotting something with scanpy.

ivirshup commented 4 years ago

Would you mind giving an example? Also, do you think this object should be saved to disk?

I hope we can switch over to the dedicated string dtype in dataframes soon, which could make this easier.

grst commented 4 years ago

I am experimenting with storing T cell receptor data from, e.g. 10x VDJ sequencing in the AnnData object. Ultimately, I want to build a scanpy extension for single cell TCR data.

The idea is to build sth like

import sctcrpy as st
sample1 = st.read_10x("../tests/data/10x/all_contig_annotations.json")

Where TcrCell is a class roughly like this:

class TcrCell:
    def __init__(self, cell_id: str):
        self._cell_id = cell_id
        self.chains = list()

    def __repr__(self):
        return "TcrCell {} with {} chains".format(self._cell_id, len(self.chains))

    @property
    def cell_id(self):
        return self._cell_id

    def add_chain(self, chain: TcrChain):
        self.chains.append(chain)

Upon _sanitize the object is converted to its string representation, which kind of takes the fun out of it ;)

Yes, this should totally be saved to h5ad, which already works, btw.

Maybe obs is also just not the right place to store this - If you have any other suggestions I'm happy to discuss them.

Cheers, Gregor

ivirshup commented 4 years ago

That sounds great!

I'd like to support this kind of thing, but I haven't had a use case yet. I'm surprised to hear that this works for you to and from disk. Could you provide a little example of this? For me, I get strings out:

Example

```python import anndata from anndata.tests.helpers import gen_adata class TcrCell: def __init__(self, cell_id: str): self._cell_id = cell_id self.chains = list() def __repr__(self): return "TcrCell {} with {} chains".format(self._cell_id, len(self.chains)) @property def cell_id(self): return self._cell_id def add_chain(self, chain: "TcrChain"): self.chains.append(chain) a = gen_adata((10, 10)) a.obs["tcr"] = [TcrCell(oid) for oid in a.obs_names] a.write_h5ad("tmp.h5ad") b = anndata.read_h5ad("./tmp.h5ad") print(type(a.obs["tcr"].values[0])) # print(type(b.obs["tcr"].values[0])) # str ```

My thought was there would need to be an encoded intermediate representation, which we could add a hook for decoding.

We might be able to switch string detection over to pd.api.types.infer_dtype now, or is_string_array once pandas 1.0 is out (https://github.com/pandas-dev/pandas/issues/8640#issuecomment-537871897).

LuckyMD commented 4 years ago

This conversation sounds relevant to @davidsebfischer who has been working with TCR data in scanpy.

grst commented 4 years ago

@ivirshup, it indeed does not work... I mistyped a variable name when testing :/ @davidsebfischer, I would be curious to learn about your project and potentially avoid duplicated efforts ;)

ivirshup commented 4 years ago

@grst, have you considered defining a pandas extension type for this?

grst commented 4 years ago

Initially, I had hoped to stick with standard AnnData, s.t. h5ad files stay compatible with systems on which the tcr-library is not installed.

But it might the way to go! ExtensionDtype could be a good option.

Would that already solve the auto-conversion issue? And I couldn't find anything on how these types would be stored in a file....

ivirshup commented 4 years ago

Initially, I had hoped to stick with standard AnnData, s.t. h5ad files stay compatible with systems on which the tcr-library is not installed.

I think this would be possible. It'll take a little work on our end, but it's something I wanted to try.

Would that already solve the auto-conversion issue?

It could, but I think that would depend on how much information is needed to construct the instances. What I'm imagining is this series being stored on disk as an array of variable length byte strings. We'd be able to tell what they're meant to be read in as by marking it in the series' datasets' attributes (like we do with sparse matrices and categoricals). At read time, we would look at the attributes and see if a method to read it has been registered – which your package would do at import time. If we can't find one, we'd just read in the array as an array of byte-strings.

It could solve the auto conversion issue if we defaulted to a read method like:

def read_series(dset: h5py.Dataset) -> pd.Series:
    try:
        return pd.Series(dset[...], name=dset.name).astype(dset.attrs["dtype"])
    except:
        return pd.Series(dset[...], name=dset.name)

Which relies on pandas registry for conversion.

Of course, this is all a fair amount of work, and may not be worth it. I see how this could be useful for something like genomic ranges, but I'm not that familiar with what can be done in VDJ analysis. What sort of computations would this allow?

davidsebfischer commented 4 years ago

Hi all, there is an bunch of extra type of preprocessing methods associated with the TCR / BCR data. I solved this via a custom class that sits on top of anndata in some code that i ll make public soon. If you want to include TCR input data into anndata, i would recommend writing a new reading function (.from_10x_vdj or so) that collects everything from the cellranger output and yields an initial anndata object that people can then manipulate. the output of this could just be additional columns in .obs so you can stick to the data structure that is there already in my opinion. There also shouldn't be issues with strings in .obs.

grst commented 4 years ago

Hi @davidsebfischer,

thanks for joining in!

Hi all, there is an bunch of extra type of preprocessing methods associated with the TCR / BCR data.

Would you mind providing some more details about these preprocessing steps? Is it sth you implemented or are you referring to some public methods?

If you want to include TCR input data into anndata, i would recommend writing a new reading function (.from_10x_vdj or so) that collects everything from the cellranger output and yields an initial anndata object that people can then manipulate. the output of this could just be additional columns in .obs so you can stick to the data structure that is there already in my opinion.

Storing all information in columns is something we also considered and probably the way to go for a first version. It also nicely plays along with scanpy plotting functions. However, some cells appear to have more TCR chains than expected and I was wondering about a data structure that is flexible enough to accommodate an arbitrary number of chains per cell.

Cheers, Gregor

cc @szabogtamas

scverse / anndata

Consider changing detections of strings from `is_string_dtype` #115