added python code to analyze cache_size.cc output

negara commented 3 years ago

Works both on Colab and locally

ssbr commented 3 years ago

Man github is not making code review possible here. I guess I need to leave all my comments in a top-level reply.

is_colab = True #if the code is run on Colab

Rather than this, I'd do a try-except import. So in cell 2, instead of if is_colab: ..., do:

try:
  from google.colab import files
except ImportError:
  # not colab
  df = pd.read_csv(file_path, header = None) 
else:
  # colab
  uploaded = files.upload()
  df = pd.read_csv(io.StringIO(uploaded[file_name].decode('utf-8')), header = None)

  df = pd.read_csv(file_path, header = None)

I believe this should be df = pd.read_csv(os.path.join(file_path, file_name), header = None) (and an import os should be added to the top).

#plot of every iteration (x: sizes, y: time)

In a surprise to me, this looks like by far the most interesting graph. Here's what it looks like when I rerun everything on my machine and plot it in a log-log plot (instead of semilog):

loglog

I expected to see it more clearly with the heat plot, but I guess scatter plots are better after all (eep) -- the effect I think we were hypothesizing would exist was that we'd always see slow accesses due to noise and accidental eviction etc., but never see fast accesses once we exceed cache size. And we can very clearly see that effect here after all! Once it reaches (illegible), there are zero fast cache accesses.

So this might actually be the most useful plot of the bunch. Can you change it to log-log though, per screenshot? Like this:

ax1.set_xscale('log', basex=2)
ax1.set_yscale('log')

x = means.keys()
y = means.values()

This doesn't work on my machine, due to the version of matplotlib I have I guess -- can you change it to x = list(means.keys()), and so on? Same with y, and medians.

btw, I played around with log x or log-log and wasn't very happy any way I went with the final set of graphs. They did not seem to ever tell the same story as the scatterplot.

negara commented 3 years ago

@ssbr so does it work with list(means.keys())?

ssbr commented 3 years ago

Yes, it works with list(means.keys()).

BTW looking more at the heatmap code, I definitely don't understand it. :( I expected the heatmaps to look like the scatterplot, but instead of dots you get increasing heat. So I should expect to see a similar effect there, but trivial manipulations still look wrong.

negara commented 3 years ago

My understanding is that heatmaps are supposed to show the number of points that fall in each bin so it should be very different from the scatter plot, which plot individual data points. Hmm?

ssbr commented 3 years ago

My understanding is that heatmaps are supposed to show the number of points that fall in each bin so it should be very different from the scatter plot, which plot individual data points. Hmm?

Talked about this in-person, but to write it down: the issue I'm having is that there are hot spots in the heat plot where it's empty in the scatter plot, which shouldn't be possible. Maybe I'm manipulating the charts wrong, but this is fishy.

google / safeside

added python code to analyze cache_size.cc output #146