linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org
BSD 2-Clause "Simplified" License
139 stars 37 forks source link

Improved attribute management #20

Closed slinnarsson closed 6 years ago

slinnarsson commented 6 years ago

Ping @gioelelm @pl-ki @kasiletti @JobLeonard

I have been working on a more pythonic and expressive API for attributes in loompy. Currently we access attributes as follows:

ds.col_attrs[“CellID”]   # get a column attribute
ds.row_attrs[“Gene”]   # Get a row attribute

ds.set_attr(“Name”, values, axis=1)   # Set a column attribute
ds.set_attr(“Name”, values, axis=0)   # Set a row attribute

ds.delete_attr(“Name”, axis=1)  # Delete a column attribute
ds.delete_attr(“Name”, axis=0)  # Delete a row attribute

We also have convenience access to attributes directly on the LoomConnection object:

ds.Gene  # Happens to be a row attribute

But this can collide with object attributes, and row/col attributes can collide with each other.

I want instead to do this::

ds.c.Gene  # Get a column attribute
ds.c[“Gene”]  # Get a column attribute

ds.c.Gene = values  # Set a column attribute
ds.c[“Gene”] = values  # Set a column attribute

del ds.c.Gene  # Delete a column attribute
del ds.c[“Gene”]  # Delete a column attribute

And same for rows via ds.r

I have written the necessary manager for this. It also lazy-loads the values, so nothing is loaded until needed, and it keeps track of the corresponding data in HDF5. It supports slicing, so you can slice all the attributes at once (I want to use this for an improved scan() method).

Now, I want to add this to loompy. I can keep col_attr,row_attr, set_attr and delete_attr for backwards compatibility. BUT, I can’t keep the direct access attributes (i.e. ds.Gene, ds.Clusters). This will break some code.

Is it ok? If you agree, I will make the change and let you know when you will need to change from ds.Gene to ds.r.Gene etc. You also can fix your code right now by changing ds.Gene to ds.row_attrs[“Gene”] if you want.

gioelelm commented 6 years ago

The possibility to do slicing is what makes it interesting for me. Honestly I am not enthusiastic with ds.r.Gene ds.c.CellID since they contain single-digit attribute names. Maybe the more verbose ds.col.CellID/ds.row.Gene or ds.ca.CellID/ds.ra.Gene are more readable and pythonic.

But feel free to make the change as you prefer. Maybe mark the change in API with a major version bump (loompy 2.0?)

slinnarsson commented 6 years ago

Ok, good point. I will go with ca and ra then. I'll try to find time to do the same for graphs, so we could do

ds.edges.KNN = sparse.coo_matrix(...)
g = ds.edges.KNN
del ds.edges.KNN

And I guess it would make sense to have something for layers too:

e = ds.layer.enrichment
ds.layer.unspliced = ...
del ds.layer.unspliced

(although maybe deleting layers is not a good idea; it will probably just make a big hole in the file)

pl-ki commented 6 years ago

Fine. To me it would feel logical to have direct access to global attributes like this:

ds.title ds.description

Peter Lönnerberg Linnarsson Lab Laboratory of Molecular Neurobiology Dept. of Medical Biochemistry and Biophysics Karolinska Institutet SE-171 77 Stockholm Sweden


Från: Sten Linnarsson [notifications@github.com] Skickat: den 16 november 2017 06:25 Till: linnarsson-lab/loompy Kopia: Peter Lönnerberg; Mention Ämne: Re: [linnarsson-lab/loompy] Improved attribute management (#20)

Ok, good point. I will go with ca and ra then. I'll try to find time to do the same for graphs, so we could do

ds.edges.KNN = sparse.coo_matrix(...) g = ds.edges.KNN del ds.edges.KNN

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnarsson-lab/loompy/issues/20#issuecomment-344819587, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AInXb0weHVghRFTVCwqoqCByebDKCkgiks5s28c-gaJpZM4Qfn2E.

kasiletti commented 6 years ago

That sounds good to me.

On Nov 15, 2017, at 10:49 PM, Sten Linnarsson notifications@github.com wrote:

Ping @gioelelm https://github.com/gioelelm @pl-ki https://github.com/pl-ki @kasiletti https://github.com/kasiletti @JobLeonard https://github.com/jobleonard I have been working on a more pythonic and expressive API for attributes in loompy. Currently we access attributes as follows:

ds.col_attrs[“CellID”] # get a column attribute ds.row_attrs[“Gene”] # Get a row attribute

ds.set_attr(“Name”, values, axis=1) # Set a column attribute ds.set_attr(“Name”, values, axis=0) # Set a row attribute

ds.delete_attr(“Name”, axis=1) # Delete a column attribute ds.delete_attr(“Name”, axis=0) # Delete a row attribute We also have convenience access to attributes directly on the LoomConnection object:

ds.Gene # Happens to be a row attribute But this can collide with object attributes, and row/col attributes can collide with each other.

I want instead to do this::

ds.c.Gene # Get a column attribute ds.c[“Gene”] # Get a column attribute

ds.c.Gene = values # Set a column attribute ds.c[“Gene”] = values # Set a column attribute

del ds.c.Gene # Delete a column attribute del ds.c[“Gene”] # Delete a column attribute And same for rows via ds.r

I have written the necessary manager for this. It also lazy-loads the values, so nothing is loaded until needed, and it keeps track of the corresponding data in HDF5. It supports slicing, so you can slice all the attributes at once (I want to use this for an improved scan() method).

Now, I want to add this to loompy. I can keep col_attr,row_attr, set_attr and delete_attr for backwards compatibility. BUT, I can’t keep the direct access attributes (i.e. ds.Gene, ds.Clusters). This will break some code.

Is it ok? If you agree, I will make the change and let you know when you will need to change from ds.Gene to ds.r.Gene etc. You also can fix your code right now by changing ds.Gene to ds.row_attrs[“Gene”] if you want.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/linnarsson-lab/loompy/issues/20, or mute the thread https://github.com/notifications/unsubscribe-auth/ADI3QBBurGIzd0MnjHuislNY-QR06JvEks5s21xPgaJpZM4Qfn2E.

JobLeonard commented 6 years ago

This looks great, and I agree with all suggestions so far.

To repeat what Gioele hinted at: if we drop ds.col_attrs["CellID"] in favour of ds.ca["CellID"] (and so on), this will require a major version bump. That's what semver is for: signalling breaks in backwards compatibility.

We can also just introduce it and mark the old ways as deprecated, meaning they will be removed when we do have enough changes for a major version bump.

We also need to make sure all documentation is updated, to avoid misinforming people with outdated information. Here's a short list of what I can think of:

Is that everything? Any other places that would need updating?

JobLeonard commented 6 years ago

(although maybe deleting layers is not a good idea; it will probably just make a big hole in the file)

We went through a similar thing with removing the tile pyramid. The only way to fix these holes is to use h5repack to do a kind of "copying garbage collection" on the loom file.

It's not something we can do from within python, but we might want to mention it in the loompy documentation? Like an FAQ section with "Why isn't my loom file smaller after deleting old, unused metadata/layers/etc?"

pl-ki commented 6 years ago

Not sure that the doc mentions a possibility to delete metadata or layers?

Peter Lönnerberg Linnarsson Lab Laboratory of Molecular Neurobiology Dept. of Medical Biochemistry and Biophysics Karolinska Institutet SE-171 77 Stockholm Sweden


Från: Job van der Zwan [notifications@github.com] Skickat: den 16 november 2017 10:19 Till: linnarsson-lab/loompy Kopia: Peter Lönnerberg; Mention Ämne: Re: [linnarsson-lab/loompy] Improved attribute management (#20)

(although maybe deleting layers is not a good idea; it will probably just make a big hole in the file)

We went through a similar thing with removing the tile pyramid. The only way to fix these holes is to use h5repackhttps://support.hdfgroup.org/HDF5/Tutor/cmdtooledit.html to do a kind of "copying garbage collection" on the loom file.

It's not something we can do from within python, but we might want to mention it in the loompy documentation? Like an FAQ section with "Why isn't my loom file smaller after deleting old, unused metadata/layers/etc?"

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnarsson-lab/loompy/issues/20#issuecomment-344862636, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AInXb0szIp8UVqyqQ0pEvf45xddYmHoCks5s2_4EgaJpZM4Qfn2E.

pl-ki commented 6 years ago

Ok found delete_attr(), but no doc about deleting a layer.

Peter Lönnerberg Linnarsson Lab Laboratory of Molecular Neurobiology Dept. of Medical Biochemistry and Biophysics Karolinska Institutet SE-171 77 Stockholm Sweden


Från: Job van der Zwan [notifications@github.com] Skickat: den 16 november 2017 10:19 Till: linnarsson-lab/loompy Kopia: Peter Lönnerberg; Mention Ämne: Re: [linnarsson-lab/loompy] Improved attribute management (#20)

(although maybe deleting layers is not a good idea; it will probably just make a big hole in the file)

We went through a similar thing with removing the tile pyramid. The only way to fix these holes is to use h5repackhttps://support.hdfgroup.org/HDF5/Tutor/cmdtooledit.html to do a kind of "copying garbage collection" on the loom file.

It's not something we can do from within python, but we might want to mention it in the loompy documentation? Like an FAQ section with "Why isn't my loom file smaller after deleting old, unused metadata/layers/etc?"

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnarsson-lab/loompy/issues/20#issuecomment-344862636, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AInXb0szIp8UVqyqQ0pEvf45xddYmHoCks5s2_4EgaJpZM4Qfn2E.

JobLeonard commented 6 years ago

(sorry, this turned into a bit of a tangent as I wrote this)

Given the potentially vast size of layers, making it easy to add and delete them and accidentally cause the loom file to grow humongous without the user noticing is probably something to avoid.

The data in all other attributes are tiny by comparison, so not likely to become an issue.

With that in mind, the only way to compact a loom file in general is creating a new file and copy all information over from the old one. h5repack is more or less just that, but as a fast tool written in C. Except for performance there is nothing stopping you from doing the same in Python. It would require some thinking about how to copy the whole data matrix efficiently though (probably in chunks of 64 rows, since that is also the chunk size we use in loom at the moment).

slinnarsson commented 6 years ago

Resolved by loompy v2.0