keroro824 / HashingDeepLearning

Codebase for "SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems"
MIT License
1.07k stars 169 forks source link

Code vs Paper #24

Closed antonyscerri closed 4 years ago

antonyscerri commented 4 years ago

Hi

In attempting to run the code and compare to the plots in your paper, but things at first don't appear to match up. I've seen the previous issue mentioning the different batch size using the Amazon-670K dataset between the committed code and what was run. I'm also curious as to the logging steps and the number of batches used to calculated the accuracy between the plots in Figure 5 and the current code. Running the TF GPU code the accuracy is reported every 50 batches, and it will achieve a top value of around 0.6 compared to the 0.3 in the paper. This is based on the log file but I want to check whether you used a different sample batch count for calculating the accuracy than the current code (which does it with 20 batches every 50 steps in python_examples/example_full_softmax.py).

It would be nice to have a setup which can reproduce the runs you have to ensure an adequate baseline. The wall clock times for a run of the TF GPU code came out much slower (>8 hours, reporting ~22s every 50 steps using the latest code from the repository) than what I believe is shown in the figures. You mention a few different run times but it's not always clear exactly which configuration that was used. I've been sticking to the full not sampling TF code on GPU so far, and I have yet to compare on the same hardware with TF CPU or SLIDE yet as I am seeing such a difference in run time and accuracy using the code as-is.

In the meantime I've been experimenting with tuning the GPU version as its GPU utilization was low and it spent quite a bit of time waiting on data. Overall i've achieved something close to a 3x improvement. You may have intentionally kept the code as similar as possible between SLIDE and TF but then that does not necessarily take advantages a GPUs raw performance. Could you comment on what if any of this may have been deliberate to offer some degree of equivalent comparison in your view. Again raw run time performance may not be the ultimate goal but understanding this and how to reproduce it is none-the-less helpful.

I did attempt to run the TF sampled softmax, however it produces an error "Input to reshape is a tensor with 85771648 values, but the requested shape requires a multiple of 100" and this was with TF 1.8 (for everything else i've been using TF 1.15 and TF 2.1.0).

Thanks

Tony

Tharun24 commented 4 years ago

Thank you for working on our code. The TF version prints the accuracy for only the first 20 batches once every 50 steps. The overall accuracy is printed only once an entire epoch is done. Please check for 'Overall p_at_1 after' line in the logfile to get the correct P@1. Also, we tried keeping things as fair as possible between SLIDE and TF. The only evident compromise with TF-GPU is that the data loader loads batches and feeds to GPU which causes slowdown. We could have used data streaming like TF Records but since SLIDE doesn't have that facility, we omitted that. We used Sparse Tensor support for the input which by itself gives 2x speedup to TF (compared to a full matrix multiplication). Nevertheless, we'll try to optimize TF GPU to make better use of GPUs and we'll fix the errors on the sampled softmax file.

antonyscerri commented 4 years ago

Hi

Thanks for confirming the code setup is as expected. I understand the logging every 50 steps and at the end of the epoch and i have plotted both from a complete run of example_full_softmax.py. i have attached two images the first is based on the per 50 step entries which looks similar in shape to those in figure 5 except the accuracy is higher which as we thought is due to only a small sample of batches being used. The y axis is the accuracy and the x axis is time, done to log scale as per your figure. There are two runs in each one with 128 batch size and the other with 256 batch size, both using the Amazon-670K dataset. These runs were done with a V100 16GB.

image

I did also plot the per epoch accuracy which produced the following plot, where the shape is very different as its only 10 epochs before its converged (one data point each), hence my confusion about how to reproduce your figure.

image

Thanks

Tony

keroro824 commented 4 years ago

Hi Tony,

Thanks for your interest in reproducing the TF curves.

  1. The comparison between your first curve and second curve proves that the true accuracy of the model matches with we reported.
  2. You can shuffle the testing data and it will be close to what you see in the first plot. We plot every 50 iterations not only several epochs (noted x-axis is log scale and insufficient data point could change the shape of the plot. That is also the reason why your first plot and second look different).

Please let us know!

antonyscerri commented 4 years ago

@keroro824 thanks for following up. The previous response from @Tharun24 matched with what i believed the code to be doing. I included two plots, the first based on the reported accuracy every 50 iterations from the sampled 20 batches from the validation set (as your code does) and the second was based on the per epoch accuracy (based on the full validation set). The first shows a similar curve to those in figure 5 in the paper however my accuracy is reaching a much higher value. the second plot shows the per epoch which is a much smoother (and different looking) curve but does reflect the ~0.3 accuracy. So for the figure in your paper did you run a full evaluation across the entire validation set at every 50 iteration or so. I'm just trying to determine where the difference is between the paper, what the code might be doing and what i'm seeing when i run it.

Tharun24 commented 4 years ago

In the paper, we evaluate with 20 batches of SHUFFLED testing data after every 50 iterations as the validation. When we finish an epoch, we use the precision of the full dataset. Shuffle the testing data and plot like the first plot above, you'll get the paper's result. Thanks for raising this up, we will add in README to remind people to shuffle the testing data to prevent biases because the original data is sorted by labels!

antonyscerri commented 4 years ago

Thanks for clarifying the use of shuffled batches in your figures, sorry for the slow response i didnt see the notification. This would make sense from our observations and I've made the changes to the code and doing another run to confirm, the peak accuracy is 0.36 so far, hopefully it will finish shortly. Can you comment whether you have used shuffling during training as none of the code appear to do so either? Thanks again

antonyscerri commented 4 years ago

I can confirm that the plot when shuffling is more similar. The overall accuracy being reported still seems to go a bit higher. So I suspect a different number of batches were used to measure. The figure in your paper looks like you reduced the granularity of the data points by using a bigger interval than every 50 training steps (which i've simulated below using every 400 steps). This is a plot against time (s) with log scale like your figure 5.

image

antonyscerri commented 4 years ago

@Tharun24 could you tell me for figure 5 in the paper the TF GPU and TF CPU runs, were these using the example_full_softmax.py or example_sampled_softmax.py code? Also is there any timeline on getting a fixed version of the sampled code committed (unless its not really being used in which case that may be a mute point).

Also the run times mentioned don't appear to match up throughout the paper, the abstract mentions 1 and 3.5 hours and later you mention 2 vs 5.5 hours. And looking at the plots the times don't appear to correspond (going by the end point of each series as possible data points). It was unclear if the wall clock time you mentioned was just taking into account the training time or the end-to-end run time.

Tharun24 commented 4 years ago

@antonyscerri

  1. I've fixed example_sampled_softmax.py in the latest version. Please give it a try. We need to preset the maximum number of true labels per data point for sampled softmax. Hence, in config.py, you can set max_label from anywhere between 1-5 for Amazon-670K. For data points with <max_label number of true labels, we pad it with a dummy label.
  2. As for the question about figure 5 in the paper, it is clear that we used full softmax (figure 7 discusses sampled softmax separately).
  3. As for the question about times being inconsistent, we want to clarify that without HugePages, we get 2.7x speed up (hence 5.5 hrs vs 2 hrs) and with HugePages, we get 3.5x speed up (hence 3.5 hrs vs 1 hr) . Sorry if our statements were confusing.
pennywise86hw commented 4 years ago

@Tharun24 Thanks for all the clarifications w.r.t. TF baseline numbers! I have a separate question on SLIDE configuration though.. When using the SLIDE code as is, I am getting just about 0.28 accuracy after 10 epochs (K = 8, and batch size 256 as suggested in the paper) while TF baseline gives me almost 0.35. Is there any difference in SLIDE hyper-parameters between what you used in the paper and source code posted on GitHub?

tair1 commented 4 years ago

Hi,

This has been a helpful thread to reproduce the results shared in the paper.

I have one question regarding the TF-GPU runs. Using the same reported settings, shuffled datasets and the same GPU (V100 32GB) I am getting comparable numbers to SLIDE, both in performance and accuracy. This happens for both the sampled and the full softmax (although the sampled is a bit faster).

I am using TF 1.15.2 and cuda 10.2 with the latest cuDNN 7.6.4.30. Which version of cuDNN are you using for the paper?

Thanks, Miguel