linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Clean up global namespace #89

Closed JobLeonard closed 7 years ago

JobLeonard commented 7 years ago

Kimberly had troubles getting loom to run within Jupyter and Spyder - see attached error logs. She's on a Mac and is running Python 3. @gioelelm looked into it, since he has the most experience with using loom in that context, and "fixed it" by commenting out everything after the first line of __init__.py. And if that still works I'm kind of wondering what that file is for then...

I suspect that the cause of the bug might be the recent switch to gevent in loom_server.py. It monkey patches the whole python library, and Jupyter itself runs kernels in a server environment, so I can see those things interfering with each other.

The thing is that I'm not sure why we're importing everything into one big blob to begin with? Aside from __init__.py and _version.py, we have these files in the python folder:

The loom CLI uses everything, but does loompy.py need to be aware of the cache, pipeline and server code to function? Because if not, we can probably avoid a lot of bugs (and future "hey loom doesn't work on my machine" complaints) by removing them from the global namespace.

Jupyter Error.txt Spyder Error.txt

slinnarsson commented 7 years ago

It would probably make sense to break out loompy.py to a standalone library, and then make the rest just import it. I would like at some point to clean out stuff from loompy.py that doesn’t belong, such as clustering and feature selection, and just leave the data manipulation stuff. I don’t have time now to fix it but feel free to go ahead.

/Sten

-- Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Sweden +46 8 52 48 75 77 (office) +46 70 399 32 06 (mobile)

On 2 Feb 2017, at 14:10, Job van der Zwan notifications@github.com wrote:

Kimberly had troubles getting loom to run within Jupyter and Spyder - see attached error logs. She's on a Mac and is running Python 3. @gioelelm looked into it, since he has the most experience with using loom in that context, and "fixed it" by commenting out everything after the first line of init.py. And if that still works I'm kind of wondering what that file is for then...

I suspect that the cause of the bug might be the recent switch to gevent in loom_server.py. It monkey patches the whole python library, and Jupyter itself runs kernels in a server environment, so I can see those things interfering with each other.

The thing is that I'm not sure why we're importing everything into one big blob to begin with? Aside from init.py and _version.py, we have these files in the python folder:

• loom (the command line tool) • loompy.py (the library) • loom_cache • loom_pipeline • loom_server The loom CLI uses everything, but does loompy.py need to be aware of the cache, pipeline and server code to function? Because if not, we can probably avoid a lot of bugs (and future "hey loom doesn't work on my machine" complaints) by removing them from the global namespace.

Jupyter Error.txt Spyder Error.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

JobLeonard commented 7 years ago

I'm already looking into it, but aside from the general difficulty of untangling code, a problem for me is that I don't fully know what does and does not belong in loom and loompy.py. I'll give it a try though.

JobLeonard commented 7 years ago

So, to continue with our plans from yesterday: we're going to split out the rows and columns into compressed binary files.

It occurred to me yesterday that there is one snag with the plan we had: a purely static server breaks the desire for minimising GET requests. For example, fetching 300 rows would imply one request for each row. So a server that merges these rows before sending them back is still desirable

One option is to stay with Flask. It's what we have, and since the bottleneck is hdf5, splitting rows/columns into separate files would still eliminate the issue.

Another would be to use Go. I'm familiar with the language (more than Python actually), is a lot more efficient, was pretty much designed to write servers, and has got good concurrency primitives out of the box. So reading in a few hundred files, merging them and returning a gzipped JSON should be much faster on it.

I'm personally leaning towards opening a github repository for the second option.

Of course, we could also say "screw it, we'll send 300 get requests", but that would severely impact performance.

JobLeonard commented 7 years ago

Actually, no, let's stick to Python for now, because it will also make the generation of the row/column files easier, and I don't expect performance to be that big of an issue after this splitting.

We can just use pickle. According to the docs it supports compression too, and in Python 3 uses cPickle by default for much better speeds. It doesn't support numpy arrays, but we don't actually want it to: our current code converts numpy arrays to lists before turning it into JSON!

As for how to divide the code for this: currently I have the tiling generation set up as a LoomConnection method, which can be called from the loom CLI with the appropriate command. I'd like to repeat this approach for generating rows/columns.

Using Pickle

Example source code for creating a compressed pickle file:

import pickle
import gzip

#store the object
myObject = {'a':'blah','b':range(10)}
f = gzip.open('testPickleFile.pklz','wb')
pickle.dump(myObject,f)
f.close()

#restore the object
f = gzip.open('testPickleFile.pklz','rb')
myNewObject = pickle.load(f)
f.close()

print myObject
print myNewObject

We can further optimise for size/performance by adding an intermediate optimize step:

pickletools.optimize(picklestring): Returns a new equivalent pickle string after eliminating unused PUT opcodes. The optimized pickle is shorter, takes less transmission time, requires less storage space, and unpickles more efficiently.

https://docs.python.org/3/library/pickletools.html#pickletools.optimize

So if I read that correctly, pickle.dump(myObject,f) would be turned into f.compress(pickletools.optimize(pickle.dumps(myObject))) and Everything Should Just Work (but we'll see about that).