Closed slinnarsson closed 6 years ago
The possibility to do slicing is what makes it interesting for me. Honestly I am not enthusiastic with ds.r.Gene ds.c.CellID since they contain single-digit attribute names. Maybe the more verbose ds.col.CellID/ds.row.Gene or ds.ca.CellID/ds.ra.Gene are more readable and pythonic.
But feel free to make the change as you prefer. Maybe mark the change in API with a major version bump (loompy 2.0?)
Ok, good point. I will go with ca
and ra
then. I'll try to find time to do the same for graphs, so we could do
ds.edges.KNN = sparse.coo_matrix(...)
g = ds.edges.KNN
del ds.edges.KNN
And I guess it would make sense to have something for layers too:
e = ds.layer.enrichment
ds.layer.unspliced = ...
del ds.layer.unspliced
(although maybe deleting layers is not a good idea; it will probably just make a big hole in the file)
Fine. To me it would feel logical to have direct access to global attributes like this:
ds.title ds.description
Peter Lönnerberg Linnarsson Lab Laboratory of Molecular Neurobiology Dept. of Medical Biochemistry and Biophysics Karolinska Institutet SE-171 77 Stockholm Sweden
Från: Sten Linnarsson [notifications@github.com] Skickat: den 16 november 2017 06:25 Till: linnarsson-lab/loompy Kopia: Peter Lönnerberg; Mention Ämne: Re: [linnarsson-lab/loompy] Improved attribute management (#20)
Ok, good point. I will go with ca and ra then. I'll try to find time to do the same for graphs, so we could do
ds.edges.KNN = sparse.coo_matrix(...) g = ds.edges.KNN del ds.edges.KNN
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnarsson-lab/loompy/issues/20#issuecomment-344819587, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AInXb0weHVghRFTVCwqoqCByebDKCkgiks5s28c-gaJpZM4Qfn2E.
That sounds good to me.
On Nov 15, 2017, at 10:49 PM, Sten Linnarsson notifications@github.com wrote:
Ping @gioelelm https://github.com/gioelelm @pl-ki https://github.com/pl-ki @kasiletti https://github.com/kasiletti @JobLeonard https://github.com/jobleonard I have been working on a more pythonic and expressive API for attributes in loompy. Currently we access attributes as follows:
ds.col_attrs[“CellID”] # get a column attribute ds.row_attrs[“Gene”] # Get a row attribute
ds.set_attr(“Name”, values, axis=1) # Set a column attribute ds.set_attr(“Name”, values, axis=0) # Set a row attribute
ds.delete_attr(“Name”, axis=1) # Delete a column attribute ds.delete_attr(“Name”, axis=0) # Delete a row attribute We also have convenience access to attributes directly on the LoomConnection object:
ds.Gene # Happens to be a row attribute But this can collide with object attributes, and row/col attributes can collide with each other.
I want instead to do this::
ds.c.Gene # Get a column attribute ds.c[“Gene”] # Get a column attribute
ds.c.Gene = values # Set a column attribute ds.c[“Gene”] = values # Set a column attribute
del ds.c.Gene # Delete a column attribute del ds.c[“Gene”] # Delete a column attribute And same for rows via ds.r
I have written the necessary manager for this. It also lazy-loads the values, so nothing is loaded until needed, and it keeps track of the corresponding data in HDF5. It supports slicing, so you can slice all the attributes at once (I want to use this for an improved scan() method).
Now, I want to add this to loompy. I can keep col_attr,row_attr, set_attr and delete_attr for backwards compatibility. BUT, I can’t keep the direct access attributes (i.e. ds.Gene, ds.Clusters). This will break some code.
Is it ok? If you agree, I will make the change and let you know when you will need to change from ds.Gene to ds.r.Gene etc. You also can fix your code right now by changing ds.Gene to ds.row_attrs[“Gene”] if you want.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/linnarsson-lab/loompy/issues/20, or mute the thread https://github.com/notifications/unsubscribe-auth/ADI3QBBurGIzd0MnjHuislNY-QR06JvEks5s21xPgaJpZM4Qfn2E.
This looks great, and I agree with all suggestions so far.
To repeat what Gioele hinted at: if we drop ds.col_attrs["CellID"]
in favour of ds.ca["CellID"]
(and so on), this will require a major version bump. That's what semver is for: signalling breaks in backwards compatibility.
We can also just introduce it and mark the old ways as deprecated, meaning they will be removed when we do have enough changes for a major version bump.
We also need to make sure all documentation is updated, to avoid misinforming people with outdated information. Here's a short list of what I can think of:
col_attrs
row_attrs
set_attr
delte_attr
Is that everything? Any other places that would need updating?
(although maybe deleting layers is not a good idea; it will probably just make a big hole in the file)
We went through a similar thing with removing the tile pyramid. The only way to fix these holes is to use h5repack
to do a kind of "copying garbage collection" on the loom file.
It's not something we can do from within python, but we might want to mention it in the loompy documentation? Like an FAQ section with "Why isn't my loom file smaller after deleting old, unused metadata/layers/etc?"
Not sure that the doc mentions a possibility to delete metadata or layers?
Peter Lönnerberg Linnarsson Lab Laboratory of Molecular Neurobiology Dept. of Medical Biochemistry and Biophysics Karolinska Institutet SE-171 77 Stockholm Sweden
Från: Job van der Zwan [notifications@github.com] Skickat: den 16 november 2017 10:19 Till: linnarsson-lab/loompy Kopia: Peter Lönnerberg; Mention Ämne: Re: [linnarsson-lab/loompy] Improved attribute management (#20)
(although maybe deleting layers is not a good idea; it will probably just make a big hole in the file)
We went through a similar thing with removing the tile pyramid. The only way to fix these holes is to use h5repackhttps://support.hdfgroup.org/HDF5/Tutor/cmdtooledit.html to do a kind of "copying garbage collection" on the loom file.
It's not something we can do from within python, but we might want to mention it in the loompy documentation? Like an FAQ section with "Why isn't my loom file smaller after deleting old, unused metadata/layers/etc?"
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnarsson-lab/loompy/issues/20#issuecomment-344862636, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AInXb0szIp8UVqyqQ0pEvf45xddYmHoCks5s2_4EgaJpZM4Qfn2E.
Ok found delete_attr(), but no doc about deleting a layer.
Peter Lönnerberg Linnarsson Lab Laboratory of Molecular Neurobiology Dept. of Medical Biochemistry and Biophysics Karolinska Institutet SE-171 77 Stockholm Sweden
Från: Job van der Zwan [notifications@github.com] Skickat: den 16 november 2017 10:19 Till: linnarsson-lab/loompy Kopia: Peter Lönnerberg; Mention Ämne: Re: [linnarsson-lab/loompy] Improved attribute management (#20)
(although maybe deleting layers is not a good idea; it will probably just make a big hole in the file)
We went through a similar thing with removing the tile pyramid. The only way to fix these holes is to use h5repackhttps://support.hdfgroup.org/HDF5/Tutor/cmdtooledit.html to do a kind of "copying garbage collection" on the loom file.
It's not something we can do from within python, but we might want to mention it in the loompy documentation? Like an FAQ section with "Why isn't my loom file smaller after deleting old, unused metadata/layers/etc?"
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnarsson-lab/loompy/issues/20#issuecomment-344862636, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AInXb0szIp8UVqyqQ0pEvf45xddYmHoCks5s2_4EgaJpZM4Qfn2E.
(sorry, this turned into a bit of a tangent as I wrote this)
Given the potentially vast size of layers, making it easy to add and delete them and accidentally cause the loom file to grow humongous without the user noticing is probably something to avoid.
The data in all other attributes are tiny by comparison, so not likely to become an issue.
With that in mind, the only way to compact a loom file in general is creating a new file and copy all information over from the old one. h5repack
is more or less just that, but as a fast tool written in C. Except for performance there is nothing stopping you from doing the same in Python. It would require some thinking about how to copy the whole data matrix efficiently though (probably in chunks of 64 rows, since that is also the chunk size we use in loom at the moment).
Resolved by loompy v2.0
Ping @gioelelm @pl-ki @kasiletti @JobLeonard
I have been working on a more pythonic and expressive API for attributes in loompy. Currently we access attributes as follows:
We also have convenience access to attributes directly on the LoomConnection object:
But this can collide with object attributes, and row/col attributes can collide with each other.
I want instead to do this::
And same for rows via
ds.r
I have written the necessary manager for this. It also lazy-loads the values, so nothing is loaded until needed, and it keeps track of the corresponding data in HDF5. It supports slicing, so you can slice all the attributes at once (I want to use this for an improved scan() method).
Now, I want to add this to loompy. I can keep
col_attr
,row_attr
,set_attr
anddelete_attr
for backwards compatibility. BUT, I can’t keep the direct access attributes (i.e. ds.Gene, ds.Clusters). This will break some code.Is it ok? If you agree, I will make the change and let you know when you will need to change from
ds.Gene
tods.r.Gene
etc. You also can fix your code right now by changingds.Gene
tods.row_attrs[“Gene”]
if you want.