Open mrocklin opened 1 year ago
@mrocklin - here's the stripped down version of the notebook, i'm in meetings the rest of my day so I'll let you take it from here. Happy to pick things up again when I wake up. https://github.com/rrpelgrim/dask-tutorial-mrocklin/blob/main/datashader-basic.ipynb
Thanks @rrpelgrim playing now. I'm able to get this down to about 2s so far. I don't have interactive stuff working though. That code seems a bit strange currently. I'll report back in a while.
Pushed up to https://github.com/mrocklin/dask-tutorial/blob/main/2-dataframes-at-scale.ipynb
It doesn't flow pedagogically yet, and I had to get rid of interaction (but I hope to get it back after talking to PyViz folks) but everything computes close to where it should. We can process through the large 800M row dataset in about three seconds. If I can get interaction in there well then that should leave the students with a positive experience.
Interactivity was weird. It resulted in far more computation, lots of slowness, and yet not actually the ability to zoom around and see things re-render. It was like the worst of both worlds. I'm sure that there is a nicer way to go about this.
If anyone has a chance to try out the notebook please do.
OK, the next thing I want I think is interactivity. In the previous notebook I think we copy-pasted the following code:
# generate interactive datashader plot
shaded = hd.datashade(hv.Points(ddf, ['dropoff_longitude', 'dropoff_latitude']), cmap=Hot, aggregator=ds.count('passenger_count'))
hd.dynspread(shaded, threshold=0.5, max_px=4).opts(bgcolor='black', xaxis=None, yaxis=None, width=900, height=500)
But this seemed broken to me in two ways:
@rrpelgrim was this also your experience?
cc'ing @ianthomas23 from the datashader in case he has suggestions
Ian, we have a notebook that uses datashader to plot the NYC Taxi data (not very original, I know). At first this takes 30s to plot one year's worth of data. In the notebook the students will progress through slimming down memory and persisting data in ram to get this running in about ~1-2s. At this performance I would love to give them an interactive pan/zoom experience. None of us know datashader well enough though to get this running smoothly. I've looked at docs a bit and I'm afraid that there are a few too many concepts for me to learn this quickly.
Do you have any suggestions on how to take some of our current code, which looks like this:
agg = datashader.Canvas().points(
source=df,
x="dropoff_longitude",
y="dropoff_latitude",
agg=datashader.count("passenger_count")
)
tf.shade(agg, cmap=Hot, how="eq_hist")
And give it a more interactive recomputation-non-pan-zoom experience? Also if there are other performance opportunities that you see please speak up.
Cheers, -matt
In the exercise we
In my experience, this code below:
# generate interactive datashader plot
shaded = hd.datashade(hv.Points(ddf, ['dropoff_longitude', 'dropoff_latitude']), cmap=Hot, aggregator=ds.count('passenger_count'))
hd.dynspread(shaded, threshold=0.5, max_px=4).opts(bgcolor='black', xaxis=None, yaxis=None, width=900, height=500)
gives me an interactive Bokeh plot which I can pan/zoom. When run on a 100-worker cluster the experience is pretty smooth (few seconds per render) On a 15-worker cluster the experience is pretty terrible (2-3 minutes per render).
But this seemed broken to me in two ways: It would do at least two passes over the data when first rendering
Yes I saw this too.
Panning and zooming wouldn't trigger additional re-renderings.
I didn't experience this. Panning and zooming does trigger re-renderings for me.
AFAIK, this code below will only ever create static images
agg = datashader.Canvas().points(
source=df,
x="dropoff_longitude",
y="dropoff_latitude",
agg=datashader.count("passenger_count")
)
tf.shade(agg, cmap=Hot, how="eq_hist")
@ianthomas23 - if you're by any chance available for a screenshare to sync on this that would be grand, i'm in Portugal and closer to you in terms of timezones :)
@mrocklin - here's a PR with some minor text edits and comments on the narrative flow: https://github.com/mrocklin/dask-tutorial/pull/4
Some initial comments:
.plot()
on your DataFrame but you can't, well you can use .hvplot()
on your Dask DataFrame in an almost identical manner. The actual code you want is something like (untested): ddf.hvplot.scatter(x="dropoff_longitude", y="dropoff_latitude", aggregator=datashader.count("passenger_count"), datashade=True, cnorm="eq_hist", cmap=Hot)
I can sync on this now @rrpelgrim. You have my work email, or you can use the gmail account associated with my github account.
Thanks @ianthomas23 - I'm going to try to run your suggested code right now, and then reach out for a live sync if still needed.
Thanks both
@ianthomas23 - your suggested code runs a lot smoother, thanks. still have a few questions about potential performance speedups. have sent an invite for a live sync if that time works for you, otherwise i'm also flexible later today.
Thanks for the chat @ianthomas23. Summary for visibility @mrocklin: there is room for performance improvements (e.g. spatial indexing) but these are mainly in the codebase and cannot be addressed within the next 2 days. For now this is as good as it's going to get.
When run on a 100-worker cluster the experience is pretty smooth (few seconds per render) On a 15-worker cluster the experience is pretty terrible (2-3 minutes per render).
The plan is to give people ~25 workers (4 vCPU each), right?
The plan is to give people ~25 workers (4 vCPU each), right
Not sure yet. I'll want to tune cluster size to make things appropriately painful when things are done poorly, but pleasant when things are done well.
@ianthomas23 now that I'm up and active could I also grab a bit of your time? Some questions:
Quick demonstration video here: https://www.loom.com/share/d16be675ccfe40fea9fddda5b24cb1ac
There might be a pass over the data to check the x and y limits, which can be avoided by specifying them using the kwargs xlim
and ylim
and presumably the values from the next notebook cell. (ref https://hvplot.holoviz.org/user_guide/Customization.html).
I don't know what is going on between the passes, that is occurring in hvplot/holoviews which I am not the resident expert on. We'd need one of my travelling colleagues to explain that.
3 is evidently catastrophic. But it was working for Richard earlier?
I don't know. @rrpelgrim ?
@ianthomas23 if you thought you might be able to diagnose 3 live I'd suggest a live call (if it's not too much of an imposition). If that's not the case though then I could pass)
OK, I've got interactivity working (I just reinstalled everything).
Three new questions:
aspect="square"
but I'm not sure that this is right, and it's complaining because I've also set height and width.Answers:
width
or height
, plus aspect
which is your data aspect ratio (xmax-xmin)/(ymax-ymin)
.x
, y
and a categorical column called say type
(poor name I know!) which is presumably either "pickup"
and "dropoff"
. This new dataframe has twice the number of rows as the original dataframe (if no missing data). Then you do a categorical datashade in which you specify a color per category and datashader does the appropriate color mixing based on counts in each pixel, then applies alpha based on the distribution of counts across the plot. This sounds a lot more complicated than it needs to, the output makes a lot of sense when viewed through a human eyeball. Your hvplot code would be something like (untested):
color_key = {'pickup': 'red', 'dropoff': 'blue'}
ddf.hvplot.scatter(x="x", y="y", aggregator=ds.by("type"), datashade=True, cnorm="eq_hist",
width=400, aspect=1.23, xlim=(133, 456), ylim={345, 678), color_key=color_key)
but use the correct width
, aspect
, xlim
and ylim
. And pure red and blue are poor color choices, better choices (from https://colorbrewer2.org/#type=qualitative&scheme=Set1&n=3) are "#e41a1c"
and "#377eb8"
.
(Edited to fix typos)
If you going to go for the categorical datashade route, you would ideally preprocess your data into the correct format beforehand rather than do the conversion live in the tutorial.
Richard, I think that trying this is worth about an hour of your time. I don't think we need to pre process and store. I think that we can probably do this in a few lines at the end, persist, and give the students a nicer exploratory experience at the end.
Thoughts?
On Tue, Nov 8, 2022, 4:51 PM Ian Thomas @.***> wrote:
If you going to go for the categorical datashade route, you would ideally preprocess your data into the correct format beforehand rather than do the conversion live in the tutorial.
— Reply to this email directly, view it on GitHub https://github.com/mrocklin/dask-tutorial/issues/2#issuecomment-1307943987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTGOGWQY3TYPQ73ZM63WHLKORANCNFSM6AAAAAARZLEIMA . You are receiving this because you were mentioned.Message ID: @.***>
Ian, thank you for continuing to engage here. I appreciate it.
On Tue, Nov 8, 2022, 5:55 PM Matthew Rocklin @.***> wrote:
Richard, I think that trying this is worth about an hour of your time. I don't think we need to pre process and store. I think that we can probably do this in a few lines at the end, persist, and give the students a nicer exploratory experience at the end.
Thoughts?
On Tue, Nov 8, 2022, 4:51 PM Ian Thomas @.***> wrote:
If you going to go for the categorical datashade route, you would ideally preprocess your data into the correct format beforehand rather than do the conversion live in the tutorial.
— Reply to this email directly, view it on GitHub https://github.com/mrocklin/dask-tutorial/issues/2#issuecomment-1307943987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTGOGWQY3TYPQ73ZM63WHLKORANCNFSM6AAAAAARZLEIMA . You are receiving this because you were mentioned.Message ID: @.***>
Ian, thank you for continuing to engage here. I appreciate it.
I'm happy to help. Actually I'm being selfish, this is an example that I'd like to use myself and indeed extend to include use of GPUs for the datashader processing. We should keep a communication channel open to combine our dask and visualisation skills to produce really good examples, perhaps with a longer lead time in future :slightly_smiling_face:
Thanks for your help here @ianthomas23, I got this to work.
@mrocklin - code is over in https://github.com/rrpelgrim/dask-tutorial-mrocklin/blob/main/2-dataframes-at-scale.ipynb under the "More detail please" header. LMK if this is looking like what you had in mind.
We should keep a communication channel open to combine our dask and visualisation skills to produce really good examples, perhaps with a longer lead time in future
Absolutely. Is there anything you'd need from us to extend this existing example to use GPUs? An example like that would make for a fun blog post, I think :)
Absolutely. Is there anything you'd need from us to extend this existing example to use GPUs? An example like that would make for a fun blog post, I think :)
All I need is time, but unfortunately that is not transferable!
I have a couple of comments about the latest code in the final cell:
width=600
to frame_width=700
to get rid of the warning. The number is larger to allow space for the axis labels and so on.aspect
should be change in longitude over change in latitude. Given the longitude and latitude bounds you are using this should be 0.4/0.3
, hence aspect=1.33
.Do you have a screenshot of the initial image produced by the final cell? It is worth a look to see if my somewhat arbitrary choice of colors is good or not.
thanks @ianthomas23, adjusting now - here's a screenshot:
Perfect, that is both informative and beautiful!
That does look beautiful. I'm looking forward to playing with it. I'll probaby be dark on this issue until later this afternoon US time. Thank you for your work here.
Thanks @rrpelgrim I've pushed your work up to my branch at the end. I won't turn this into an exercise. It'll just be something fun to play with at the end.
Running into this, which is a bit odd: https://github.com/dask/distributed/issues/7289
@rrpelgrim and I just went through things. I recommend the following narrative:
Let's load some data and plot it (maybe non-interactive at first)
This takes a while. If you want to take a look check out the profile tab in the dashboard (but we won't explain it much here in the interests of time).