Open ivirshup opened 5 years ago
I think strings are at least the primary reason for object dtypes to appear. Sadly there’s no variable width unicode string dtype that would make things clearer. But you’re of course right that this detection does not match our assumptions.
The rationale is simply that it takes a bit less memory and operations are faster if there’s a limited number of different categories. E.g. an index/id column with a separate value in each row won’t profit for it, but a cluster_label
column with like 5 distinct values does.
Definitely strings are the primary, but the current implementation will have unintended consequences. Objects which are "equal", but are not the same object will get modified here.
Just read a couple blog posts about performance benchmarks on this, and it seems pretty convincing. R
has just made me very cautious about converting strings to categoricals by default.
When Chuck Norris uses R, stringsAsFactors
defaults to False ;)
Yeah it’s true, R factors are fiddly. I’ve oscillated a few times between “strings everywhere” and “factors wherever useful”. Currently I’m in the latter mode, since facet_wrap
respects factor level order.
I now have a use-case where this becomes an issue:
I would like to add a column to obs
that contains an object holding T-cell receptor chains for each cell. Unfortunately it gets converted to its string representation when plotting something with scanpy.
Would you mind giving an example? Also, do you think this object should be saved to disk?
I hope we can switch over to the dedicated string dtype in dataframes soon, which could make this easier.
I am experimenting with storing T cell receptor data from, e.g. 10x VDJ sequencing in the AnnData object. Ultimately, I want to build a scanpy extension for single cell TCR data.
The idea is to build sth like
import sctcrpy as st
sample1 = st.read_10x("../tests/data/10x/all_contig_annotations.json")
Where TcrCell
is a class roughly like this:
class TcrCell:
def __init__(self, cell_id: str):
self._cell_id = cell_id
self.chains = list()
def __repr__(self):
return "TcrCell {} with {} chains".format(self._cell_id, len(self.chains))
@property
def cell_id(self):
return self._cell_id
def add_chain(self, chain: TcrChain):
self.chains.append(chain)
Upon _sanitize
the object is converted to its string representation, which kind of takes the fun out of it ;)
Yes, this should totally be saved to h5ad
, which already works, btw.
Maybe obs
is also just not the right place to store this - If you have any other suggestions I'm happy to discuss them.
Cheers, Gregor
That sounds great!
I'd like to support this kind of thing, but I haven't had a use case yet. I'm surprised to hear that this works for you to and from disk. Could you provide a little example of this? For me, I get strings out:
My thought was there would need to be an encoded intermediate representation, which we could add a hook for decoding.
We might be able to switch string detection over to pd.api.types.infer_dtype
now, or is_string_array
once pandas 1.0 is out (https://github.com/pandas-dev/pandas/issues/8640#issuecomment-537871897).
This conversation sounds relevant to @davidsebfischer who has been working with TCR data in scanpy.
@ivirshup, it indeed does not work... I mistyped a variable name when testing :/ @davidsebfischer, I would be curious to learn about your project and potentially avoid duplicated efforts ;)
@grst, have you considered defining a pandas extension type for this?
Initially, I had hoped to stick with standard AnnData
, s.t. h5ad
files stay compatible with systems on which the tcr-library is not installed.
But it might the way to go! ExtensionDtype could be a good option.
Would that already solve the auto-conversion issue? And I couldn't find anything on how these types would be stored in a file....
Initially, I had hoped to stick with standard AnnData, s.t. h5ad files stay compatible with systems on which the tcr-library is not installed.
I think this would be possible. It'll take a little work on our end, but it's something I wanted to try.
Would that already solve the auto-conversion issue?
It could, but I think that would depend on how much information is needed to construct the instances. What I'm imagining is this series being stored on disk as an array of variable length byte strings. We'd be able to tell what they're meant to be read in as by marking it in the series' datasets' attributes (like we do with sparse matrices and categoricals). At read time, we would look at the attributes and see if a method to read it has been registered – which your package would do at import time. If we can't find one, we'd just read in the array as an array of byte-strings.
It could solve the auto conversion issue if we defaulted to a read method like:
def read_series(dset: h5py.Dataset) -> pd.Series:
try:
return pd.Series(dset[...], name=dset.name).astype(dset.attrs["dtype"])
except:
return pd.Series(dset[...], name=dset.name)
Which relies on pandas registry for conversion.
Of course, this is all a fair amount of work, and may not be worth it. I see how this could be useful for something like genomic ranges, but I'm not that familiar with what can be done in VDJ analysis. What sort of computations would this allow?
Hi all, there is an bunch of extra type of preprocessing methods associated with the TCR / BCR data. I solved this via a custom class that sits on top of anndata in some code that i ll make public soon. If you want to include TCR input data into anndata, i would recommend writing a new reading function (.from_10x_vdj
or so) that collects everything from the cellranger output and yields an initial anndata object that people can then manipulate. the output of this could just be additional columns in .obs so you can stick to the data structure that is there already in my opinion. There also shouldn't be issues with strings in .obs.
Hi @davidsebfischer,
thanks for joining in!
Hi all, there is an bunch of extra type of preprocessing methods associated with the TCR / BCR data.
Would you mind providing some more details about these preprocessing steps? Is it sth you implemented or are you referring to some public methods?
If you want to include TCR input data into anndata, i would recommend writing a new reading function (
.from_10x_vdj
or so) that collects everything from the cellranger output and yields an initial anndata object that people can then manipulate. the output of this could just be additional columns in .obs so you can stick to the data structure that is there already in my opinion.
Storing all information in columns is something we also considered and probably the way to go for a first version. It also nicely plays along with scanpy plotting functions. However, some cells appear to have more TCR chains than expected and I was wondering about a data structure that is flexible enough to accommodate an arbitrary number of chains per cell.
Cheers, Gregor
cc @szabogtamas
@flying-sheep
Just found this issue for pandas https://github.com/pandas-dev/pandas/issues/21653, and looked at the code for
is_string_dtype
and think it probably shouldn't be relied on. Anything whosedtype.kind
isO
(anything pandas considers anobject
) returnsTrue
. For example:It's probably not reasonable to assume any column that has objects should be converted to categorical.
Side note: I'm unaware of the rationale for converting all string columns to categorical. Could someone point me towards the reasoning behind this?