mrocklin / dask-tutorial

BSD 3-Clause "New" or "Revised" License
17 stars 9 forks source link

Datashader notebook #2

Open mrocklin opened 1 year ago

mrocklin commented 1 year ago

@rrpelgrim and I just went through things. I recommend the following narrative:

  1. Welcome! We're going to make it easy to interactively visualize N million points.
  2. Beloved matplotlib doesn't work in this case (here is an image showing it not working). We'll do this with datashader instead and get beautiful images like this ... (no code here, just images copied in)
  3. Let's load some data and plot it (maybe non-interactive at first)

    This takes a while. If you want to take a look check out the profile tab in the dashboard (but we won't explain it much here in the interests of time).

  4. This doesn't look good. Mostly it's because data is way far away. Exercise, use pandas syntax to filter to where latitude is between X and Y and where longitude is between X and Y
  5. Visualize again, how does it look? (Hopefully it looks beautiful)
  6. This takes a long time though, how can we make it faster? Let's try persisting in memory. What happens?
  7. OK our cluster ran out of memory. What are some solutions?
  8. Let's slim down our dataset using one of the following approaches:
    • nicer dtypes (particularly pyarrow strings and categoricals)
    • just remove some data
    • Great! Does it fit in memory? How much faster does this make a render?
  9. Let's go interactive with pan/zoom.
  10. ... TODO: matt figures out how to make this even faster
avriiil commented 1 year ago

@mrocklin - here's the stripped down version of the notebook, i'm in meetings the rest of my day so I'll let you take it from here. Happy to pick things up again when I wake up. https://github.com/rrpelgrim/dask-tutorial-mrocklin/blob/main/datashader-basic.ipynb

mrocklin commented 1 year ago

Thanks @rrpelgrim playing now. I'm able to get this down to about 2s so far. I don't have interactive stuff working though. That code seems a bit strange currently. I'll report back in a while.

mrocklin commented 1 year ago

Pushed up to https://github.com/mrocklin/dask-tutorial/blob/main/2-dataframes-at-scale.ipynb

It doesn't flow pedagogically yet, and I had to get rid of interaction (but I hope to get it back after talking to PyViz folks) but everything computes close to where it should. We can process through the large 800M row dataset in about three seconds. If I can get interaction in there well then that should leave the students with a positive experience.

Interactivity was weird. It resulted in far more computation, lots of slowness, and yet not actually the ability to zoom around and see things re-render. It was like the worst of both worlds. I'm sure that there is a nicer way to go about this.

If anyone has a chance to try out the notebook please do.

mrocklin commented 1 year ago

OK, the next thing I want I think is interactivity. In the previous notebook I think we copy-pasted the following code:

# generate interactive datashader plot
shaded = hd.datashade(hv.Points(ddf, ['dropoff_longitude', 'dropoff_latitude']), cmap=Hot, aggregator=ds.count('passenger_count'))
hd.dynspread(shaded, threshold=0.5, max_px=4).opts(bgcolor='black', xaxis=None, yaxis=None, width=900, height=500)

But this seemed broken to me in two ways:

  1. It would do at least two passes over the data when first rendering
  2. Panning and zooming wouldn't trigger additional re-renderings.

@rrpelgrim was this also your experience?

cc'ing @ianthomas23 from the datashader in case he has suggestions

Ian, we have a notebook that uses datashader to plot the NYC Taxi data (not very original, I know). At first this takes 30s to plot one year's worth of data. In the notebook the students will progress through slimming down memory and persisting data in ram to get this running in about ~1-2s. At this performance I would love to give them an interactive pan/zoom experience. None of us know datashader well enough though to get this running smoothly. I've looked at docs a bit and I'm afraid that there are a few too many concepts for me to learn this quickly.

Do you have any suggestions on how to take some of our current code, which looks like this:

agg = datashader.Canvas().points(
    source=df, 
    x="dropoff_longitude", 
    y="dropoff_latitude", 
    agg=datashader.count("passenger_count")
)

tf.shade(agg, cmap=Hot, how="eq_hist")

And give it a more interactive recomputation-non-pan-zoom experience? Also if there are other performance opportunities that you see please speak up.

Cheers, -matt

In the exercise we

avriiil commented 1 year ago

In my experience, this code below:

# generate interactive datashader plot
shaded = hd.datashade(hv.Points(ddf, ['dropoff_longitude', 'dropoff_latitude']), cmap=Hot, aggregator=ds.count('passenger_count'))
hd.dynspread(shaded, threshold=0.5, max_px=4).opts(bgcolor='black', xaxis=None, yaxis=None, width=900, height=500)

gives me an interactive Bokeh plot which I can pan/zoom. When run on a 100-worker cluster the experience is pretty smooth (few seconds per render) On a 15-worker cluster the experience is pretty terrible (2-3 minutes per render).

But this seemed broken to me in two ways: It would do at least two passes over the data when first rendering

Yes I saw this too.

Panning and zooming wouldn't trigger additional re-renderings.

I didn't experience this. Panning and zooming does trigger re-renderings for me.

AFAIK, this code below will only ever create static images

agg = datashader.Canvas().points(
    source=df, 
    x="dropoff_longitude", 
    y="dropoff_latitude", 
    agg=datashader.count("passenger_count")
)

tf.shade(agg, cmap=Hot, how="eq_hist")

@ianthomas23 - if you're by any chance available for a screenshare to sync on this that would be grand, i'm in Portugal and closer to you in terms of timezones :)

avriiil commented 1 year ago

@mrocklin - here's a PR with some minor text edits and comments on the narrative flow: https://github.com/mrocklin/dask-tutorial/pull/4

ianthomas23 commented 1 year ago

Some initial comments:

  1. Yes, there are 2 passes over the data for the first rendering using holoviews or other libraries built on top of it.
  2. Yes, Datashader creates static images only, all UI stuff is in other HoloViz projects that build on top of it.
  3. For interactivity you want to be using https://hvplot.holoviz.org/ rather than holoviews. Early on you talk about it would be great to use .plot() on your DataFrame but you can't, well you can use .hvplot() on your Dask DataFrame in an almost identical manner. The actual code you want is something like (untested): ddf.hvplot.scatter(x="dropoff_longitude", y="dropoff_latitude", aggregator=datashader.count("passenger_count"), datashade=True, cnorm="eq_hist", cmap=Hot)

I can sync on this now @rrpelgrim. You have my work email, or you can use the gmail account associated with my github account.

avriiil commented 1 year ago

Thanks @ianthomas23 - I'm going to try to run your suggested code right now, and then reach out for a live sync if still needed.

mrocklin commented 1 year ago

Thanks both

avriiil commented 1 year ago

@ianthomas23 - your suggested code runs a lot smoother, thanks. still have a few questions about potential performance speedups. have sent an invite for a live sync if that time works for you, otherwise i'm also flexible later today.

avriiil commented 1 year ago

Thanks for the chat @ianthomas23. Summary for visibility @mrocklin: there is room for performance improvements (e.g. spatial indexing) but these are mainly in the codebase and cannot be addressed within the next 2 days. For now this is as good as it's going to get.

ntabris commented 1 year ago

When run on a 100-worker cluster the experience is pretty smooth (few seconds per render) On a 15-worker cluster the experience is pretty terrible (2-3 minutes per render).

The plan is to give people ~25 workers (4 vCPU each), right?

mrocklin commented 1 year ago

The plan is to give people ~25 workers (4 vCPU each), right

Not sure yet. I'll want to tune cluster size to make things appropriately painful when things are done poorly, but pleasant when things are done well.

mrocklin commented 1 year ago

@ianthomas23 now that I'm up and active could I also grab a bit of your time? Some questions:

  1. I'm seeing four passes, not two
  2. I'm curious what's going on in between the passes
  3. I'm not getting any update on pan/zoom

Quick demonstration video here: https://www.loom.com/share/d16be675ccfe40fea9fddda5b24cb1ac

ianthomas23 commented 1 year ago

There might be a pass over the data to check the x and y limits, which can be avoided by specifying them using the kwargs xlim and ylim and presumably the values from the next notebook cell. (ref https://hvplot.holoviz.org/user_guide/Customization.html).

I don't know what is going on between the passes, that is occurring in hvplot/holoviews which I am not the resident expert on. We'd need one of my travelling colleagues to explain that.

3 is evidently catastrophic. But it was working for Richard earlier?

mrocklin commented 1 year ago

I don't know. @rrpelgrim ?

@ianthomas23 if you thought you might be able to diagnose 3 live I'd suggest a live call (if it's not too much of an imposition). If that's not the case though then I could pass)

mrocklin commented 1 year ago

OK, I've got interactivity working (I just reinstalled everything).

Three new questions:

  1. To avoid the first two computations can I specify an initial x/y_range? Also, the data is wonky and I'd like to avoid students having to zoom in the first time.
  2. More importantly, can I force a square aspect ratio. I tried doing this with aspect="square" but I'm not sure that this is right, and it's complaining because I've also set height and width.
  3. I'd like to overlay both pickup and dropoff latitudes, probably with Hot and Cold colormaps. Suggestions on how to make this look good? (I suspect that this is something that you all do often)
ianthomas23 commented 1 year ago

Answers:

  1. Yes. Although I've told there is an issue open (https://github.com/holoviz/holoviews/issues/5237) which could possibly mean there is a bug due to scanning the data more times than is strictly necessary.
  2. Set either width or height, plus aspect which is your data aspect ratio (xmax-xmin)/(ymax-ymin).
  3. You can do two separate datashades and overlay them, but this isn't good as you'll have to use some transparency for the second one for the first one to show through. The canonical way is to set up your dataframe to have columns that are x, y and a categorical column called say type (poor name I know!) which is presumably either "pickup" and "dropoff". This new dataframe has twice the number of rows as the original dataframe (if no missing data). Then you do a categorical datashade in which you specify a color per category and datashader does the appropriate color mixing based on counts in each pixel, then applies alpha based on the distribution of counts across the plot. This sounds a lot more complicated than it needs to, the output makes a lot of sense when viewed through a human eyeball. Your hvplot code would be something like (untested):
    color_key = {'pickup': 'red', 'dropoff': 'blue'}
    ddf.hvplot.scatter(x="x", y="y", aggregator=ds.by("type"), datashade=True, cnorm="eq_hist",
                   width=400, aspect=1.23, xlim=(133, 456), ylim={345, 678), color_key=color_key)

    but use the correct width, aspect, xlim and ylim. And pure red and blue are poor color choices, better choices (from https://colorbrewer2.org/#type=qualitative&scheme=Set1&n=3) are "#e41a1c" and "#377eb8".

(Edited to fix typos)

ianthomas23 commented 1 year ago

If you going to go for the categorical datashade route, you would ideally preprocess your data into the correct format beforehand rather than do the conversion live in the tutorial.

mrocklin commented 1 year ago

Richard, I think that trying this is worth about an hour of your time. I don't think we need to pre process and store. I think that we can probably do this in a few lines at the end, persist, and give the students a nicer exploratory experience at the end.

Thoughts?

On Tue, Nov 8, 2022, 4:51 PM Ian Thomas @.***> wrote:

If you going to go for the categorical datashade route, you would ideally preprocess your data into the correct format beforehand rather than do the conversion live in the tutorial.

— Reply to this email directly, view it on GitHub https://github.com/mrocklin/dask-tutorial/issues/2#issuecomment-1307943987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTGOGWQY3TYPQ73ZM63WHLKORANCNFSM6AAAAAARZLEIMA . You are receiving this because you were mentioned.Message ID: @.***>

mrocklin commented 1 year ago

Ian, thank you for continuing to engage here. I appreciate it.

On Tue, Nov 8, 2022, 5:55 PM Matthew Rocklin @.***> wrote:

Richard, I think that trying this is worth about an hour of your time. I don't think we need to pre process and store. I think that we can probably do this in a few lines at the end, persist, and give the students a nicer exploratory experience at the end.

Thoughts?

On Tue, Nov 8, 2022, 4:51 PM Ian Thomas @.***> wrote:

If you going to go for the categorical datashade route, you would ideally preprocess your data into the correct format beforehand rather than do the conversion live in the tutorial.

— Reply to this email directly, view it on GitHub https://github.com/mrocklin/dask-tutorial/issues/2#issuecomment-1307943987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTGOGWQY3TYPQ73ZM63WHLKORANCNFSM6AAAAAARZLEIMA . You are receiving this because you were mentioned.Message ID: @.***>

ianthomas23 commented 1 year ago

Ian, thank you for continuing to engage here. I appreciate it.

I'm happy to help. Actually I'm being selfish, this is an example that I'd like to use myself and indeed extend to include use of GPUs for the datashader processing. We should keep a communication channel open to combine our dask and visualisation skills to produce really good examples, perhaps with a longer lead time in future :slightly_smiling_face:

avriiil commented 1 year ago

Thanks for your help here @ianthomas23, I got this to work.

@mrocklin - code is over in https://github.com/rrpelgrim/dask-tutorial-mrocklin/blob/main/2-dataframes-at-scale.ipynb under the "More detail please" header. LMK if this is looking like what you had in mind.

avriiil commented 1 year ago

We should keep a communication channel open to combine our dask and visualisation skills to produce really good examples, perhaps with a longer lead time in future

Absolutely. Is there anything you'd need from us to extend this existing example to use GPUs? An example like that would make for a fun blog post, I think :)

ianthomas23 commented 1 year ago

Absolutely. Is there anything you'd need from us to extend this existing example to use GPUs? An example like that would make for a fun blog post, I think :)

All I need is time, but unfortunately that is not transferable!

I have a couple of comments about the latest code in the final cell:

  1. Change width=600 to frame_width=700 to get rid of the warning. The number is larger to allow space for the axis labels and so on.
  2. The aspect should be change in longitude over change in latitude. Given the longitude and latitude bounds you are using this should be 0.4/0.3, hence aspect=1.33.

Do you have a screenshot of the initial image produced by the final cell? It is worth a look to see if my somewhat arbitrary choice of colors is good or not.

avriiil commented 1 year ago

thanks @ianthomas23, adjusting now - here's a screenshot:

Screenshot 2022-11-09 at 12 21 51
ianthomas23 commented 1 year ago

Perfect, that is both informative and beautiful!

mrocklin commented 1 year ago

That does look beautiful. I'm looking forward to playing with it. I'll probaby be dark on this issue until later this afternoon US time. Thank you for your work here.

mrocklin commented 1 year ago

Thanks @rrpelgrim I've pushed your work up to my branch at the end. I won't turn this into an exercise. It'll just be something fun to play with at the end.

mrocklin commented 1 year ago

Running into this, which is a bit odd: https://github.com/dask/distributed/issues/7289