Open flying-sheep opened 4 years ago
numeric
Could be better, but I think it gives the important information (what element of the AnnData, and what key) at the end:
'index'
I think this one should work fine now. _index
won't work, but I think we give a good error.
containing slashes (unsure if this is still the case)
To my surprise, this one works with dataframes, but not with values in mappings.
import numpy as np
import anndata as ad
from anndata.tests.helpers import gen_adata
a = gen_adata((10, 10))
a.obs["name/with/slashes"] = 1
a.write("tmp.h5ad")
b = ad.read_h5ad("tmp.h5ad")
print(b.obs.columns)
# Index(['obs_cat', 'cat_ordered', 'int64', 'float64', 'uint8',
# 'name/with/slashes'],
a.obsm["name/with/slashes"] = np.ones((a.shape[0], 10))
a.write_h5ad("tmp.h5ad")
b = ad.read_h5ad("tmp.h5ad")
non-unicode:
I don't get an error for that one.
Update on name/with/slashes
in dataframes:
It works, but I'm not sure it's doing the right thing. Zarr does something similar. Truncated output of h5ls -r tmp.h5ad
:
/obs Group
/obs/__categories Group
/obs/__categories/cat_ordered Dataset {10}
/obs/__categories/obs_cat Dataset {7}
/obs/_index Dataset {10}
/obs/cat_ordered Dataset {10}
/obs/float64 Dataset {10}
/obs/int64 Dataset {10}
/obs/name Group
/obs/name/with Group
/obs/name/with/slashes Dataset {10}
/obs/obs_cat Dataset {10}
/obs/uint8 Dataset {10}
Should we normalize or error for these? I'm thinking normalize since I think slashes will be fairly common in column/ group names. E.g. "CD4+/CD8-".
It would be good to use some external standard for normalizing these names. I'm not sure where to find this kind of thing though.
[ ] Option A mangle/encode the names (e.g. replacing slashes with the unicode division slash /→∕), which modifies the names, so things aren’t as the user specified) and document this mangling in the format documentation.
I think mangling is inherently problematic, as it’ll result in subtle bugs arising from just a few columns not matching the names you put in. In this case it’s less bad because we wouldn’t mangle data (like gene names) but manually-addressed metadata, and we can unmangle again, so the bugs would be in other peoples’ code reading AnnData.
[ ] Option B store the names somewhere else (which makes the file less explorable since you now need code instead of a generic CLI tool to descend into the HDF5 file)
[ ] Option C leave things as is, fix all code that iterates over HDF5-stored dataframe columns, and document this so people wanting to read the thing don’t make a mistake
If it works, why does it matter if it’s internally represented as a tree of groups? The only problem is code that loops over dataframe columns, which would have to be changed to match and be a source of bugs for people wanting to read AnnData who don’t know this.
Option A mangle/encode the names (e.g. replacing slashes with the unicode division slash /→∕), which modifies the names, so things aren’t as the user specified) and document this mangling in the format documentation.
I was thinking we'd only mangle to and from disk, and it should be a in a completely reversible way. Basically, I was hoping we could just escape characters that filesystems treat as special.
I'm not sure if this can actually be done with hdf5. The current hdf5 user guide (which is conveniently only available as a pdf, so you get the old manual when you google) says group names must be ascii and can't a "." or "/". However, group names can contain a "." and be unicode (mention in other parts), so maybe there's a way for them to include "/'.
Option C ... fix all code that iterates over HDF5-stored dataframe columns
What are you thinking would be fixed here? Right now, I think it "works" for dataframes. It errors for values in the mapping attributes.
I was thinking we'd only mangle to and from disk, and it should be a in a completely reversible way.
You mean exactly the way I described above with the division slash? We can of course also escape the division slash by doubling it in the off chance someone uses one as column name, then stuff is reversible.
The current hdf5 user guide (which is conveniently only available as a pdf, so you get the old manual when you google)
fucking amazing, their online PDF viewer doesn’t even support Ctrl+F. Do they actively want to make developers’ lives harder?
I was hoping we could just escape characters that filesystems treat as special
HAhahahaHA “just” 🤪 Oh god, if we had to support filesystems we could either invest days into researching all the legacy crap that went into the 3 OS’s main file systems during decades of computing history or restrict ourselves to [A-Za-z_-]
(or so! no idea if dashes and underscores are allowed everywhere! I mean, colons aren’t, so everything is possible 😵)
What are you thinking would be fixed here? Right now, I think it "works" for dataframes. It errors for values in the mapping attributes.
It does? We don’t have code that does for column_dataset in columns_group
without checking if column_dataset
is a group itself?
You mean exactly the way I described above with the division slash?
Not that, because it becomes non-obvious how to read the group. It would be really annoying for f["gene+/gene-"] to not work when it looks like it should. I was thinking more of escaping like \/
.
Do they actively want to make developers’ lives harder?
Well, they have pivoted to emphasize consulting...
HAhahahaHA “just”
Well, I didn't want to have to do it, but figured this would be a common problem for projects like zarr, and we'd just be able to use that. There is a normalize_key
argument, but it might just enforce lowercasing.
It does? We don’t have code that does for column_dataset in columns_group without checking if column_dataset is a group itself?
Since the order of the columns is important for dataframes, we save the names in the correct order in the attributes, then do {k: h5group[k] for k in colnames}
. Here's the logic: https://github.com/theislab/anndata/blob/4440b90ff3dff213b4c512478e21426cf210368d/anndata/_io/h5ad.py#L460-L469
I was thinking more of escaping like
\/
.
I’m pretty sure this won’t work. Escaping is something that depends on the interpreter. If HDF5 group names work like files system inode names in this regard it won’t interpret escape characters, it’ll simply do *group_names, value_name = value_path.split('/')
, making it impossible to put a literal slash anywhere in a group name.
Not that, because it becomes non-obvious how to read the group. It would be really annoying for
f["gene+/gene-"]
to not work when it looks like it should.
As I said, if we go the mangling route, we should document whatever we do and de-mangle group names when reading. But yeah, it’s probably a good idea to not use lookalike characters. We should use an escape character that looks outlandishly unicody and is very unlikely to be used in an old anndata object so we don’t accidentally “unescape” something that was never escaped to begin with. Word uses some like that for their totally-not-LaTeX UnicodeMath (PDF)
Well, they have pivoted to emphasize consulting...
Well, did I ever hit the nail on the head.
Well, I didn't want to have to do it, but figured this would be a common problem for projects like zarr, and we'd just be able to use that.
Ah fuck, right, Zarr does that. Well, if we want to use it more prominently we have to stop using group names to store column names, plain and simple. The pain of figuring out and escaping all known file systems’ disallowed characters is one thing, but then there’s also device files (which on windows don’t have absolute paths and are named e.g. com1), different file systems are case sensitive while others aren’t, starting a column name with /
might introduce a security risk, and I bet there’s more hidden nastiness. Just no.
There’s a too generic/unhelpful error when writing DataFrame column names that are
'index'
non-unicode: issue_52_demo.tar.gz
52 has more details