scverse / anndata-tutorials

Use cases for anndata.
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

(feat): `{read,write}_dispatched` notebook #17

Closed ilan-gold closed 1 year ago

review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

ilan-gold commented 1 year ago

Note to self: update notebook to read from store instead of string so that we can use the consolidated metadata properly ( I think)

ivirshup commented 1 year ago

python3 -m http.server 8080

Do you think running this via subprocess would be fine? I figure it would keep this more self-contained + runnable on binder.

update notebook to read from store instead of string so that we can use the consolidated metadata properly

zarr.open_consolidated instead of zarr.open?

ilan-gold commented 1 year ago

I think the thing about subprocess is that I don't think you'll be able to monitor what's going in and out. We could route the output to another file but I'll to fiddle around with it.

ivirshup commented 1 year ago

I think the thing about subprocess is that I don't think you'll be able to monitor what's going in and out.

IDK that it's critical to monitor this.

But this should be possible with something like:

from subprocess import Popen, PIPE

proc = Popen(["python3", "-m", "http.server", "8080"], stdout=PIPE, stderr=PIPE)
ivirshup commented 1 year ago

@ilan-gold, also any file that's new enough for read_dispatched to work on wouldn't need backwards compat.

flying-sheep commented 1 year ago

subprocess.run is the preferred interface

ivirshup commented 1 year ago

@flying-sheep I believe run will wait until the process closes:

image

but we want something that continues to run on another thread.

Looking into it a bit more, I think this would work (but not on windows):

import os
from subprocess import Popen, PIPE

proc = Popen(["python3", "-m", "http.server", "8080"], stderr=PIPE)
os.set_blocking(proc.stderr.fileno(), False)

# Run this each time you want to see the server output
for line in proc.stderr.readlines():
    print(line.decode("utf-8").strip())
flying-sheep commented 1 year ago

Ah, good point. I usually recommend it because it has a lot of good defaults and shortcuts while still accepting all Popen args, but you’re of course right, it will wait.

I guess in that case I’d only recommend using sys.executable and be done with bikeshedding lol

ilan-gold commented 1 year ago

I guess people don't necessarily have to see what is going on, but I also think it's a nice thing. Otherwise, there isn't much ostensibly making this different than either the dask demo or a simpler read_dispatched that just downloads the data first. But maybe people are already familiar enough with access patterns that they'll understand what is going on here and why it is significant.

ilan-gold commented 1 year ago

So if we want, we can advise that they do the above on non-windows and then just use a shell in windows,

ilan-gold commented 1 year ago

@flying-sheep I believe run will wait until the process closes:

image

but we want something that continues to run on another thread.

Looking into it a bit more, I think this would work (but not on windows):

import os
from subprocess import Popen, PIPE

proc = Popen(["python3", "-m", "http.server", "8080"], stderr=PIPE)
os.set_blocking(proc.stderr.fileno(), False)

# Run this each time you want to see the server output
for line in proc.stderr.readlines():
    print(line.decode("utf-8").strip())

I am having some trouble getting this to work....

ilan-gold commented 1 year ago

It seems like piping is not working as expected for me.

ivirshup commented 1 year ago

Yeah, I also gave it a shot and was running into dask blocking during read... For some reason it worked if I just used sc.datasets.pbmc3k_processed().raw.to_adata()?

But if this only happens when the server is run from the notebook, I assume it's some concurrency issue, and it might be difficult to figure out if the problem is jupyter, dask, or os.set_blocking


While I would really like to be able to run this whole notebook without having to run something from a separate terminal, it's not required.

I would also note that there is just a ton of output from the initial read, even with consolidated metadata (since all the other values need to load). This might be a little hard to parse.

ilan-gold commented 1 year ago

@ivirshup That is true. I also noticed that. Not sure what to do about that. Maybe it makes it not worth it to use this, but I do think that since the whole point of this is "how can we do certain operations well?" it may be essential.

ivirshup commented 1 year ago

I'm going to merge since this is one of the last things blocking a release candidate. We can clean it up further separately.