question on the bigdata.py

jostmey / NakedTensor

Bare bone examples of machine learning in TensorFlow

Apache License 2.0

2.43k stars 146 forks source link

question on the bigdata.py #14

Open drzhouq opened 7 years ago

drzhouq commented 7 years ago

Thanks for setting this up. I am wondering the "bigdata.py". It appears to me the code does not use all the data from the "big data" population and only samples 8 points a time. That is no different than the "tensor.py" which just use the same 8 points over and over. Can you elaborate? Thanks.

jostmey commented 7 years ago

Lines 57 through 68 randomly sample the large dataset using NumPy code. Only 8 datapoints are loaded at each step of gradient descent. However, with each pass over the for-loop, another 8 datapoints are randomly sampled

drzhouq commented 7 years ago

Thanks a lot, Jared. I might not have made myself clear. The "bigdata.py" is supposed to demonstrate that we can handle "large" volume of data with TensorFlow. Because we only sample 8 points at a time, this means TensorFlow still deals with a very small amount of data. I fail to see how TensorFlow scales up to "large" volume of data in "bigdata.py" script. Both "bigdata.py" and "tensor.py" run "_EPOCH" times, the only difference is that "tensor.py" use the same data for each loop, and "bigdata.py" samples different data from a large population.

jostmey commented 7 years ago

Tensor.py shows you how to process samples in parallel. If you have a GPU, you can increase _BATCH to something like 100 or 1,000 or even 10,000 and using tensors it will run in parellel. There are two reasons why you might want a bigger _BATCH. (1) You can get away with a larger step size using fewer _EPOCHS. (2) You can handle data with more "variance", which is to say if you are classifying between 100 outcomes, at a minimum you want a batch size on the order of 100 (otherwise, convergence with Stochastic Gradient Descent will be slow).

But you're right, you will need to run through a "_EPOCH" many times

drzhouq commented 7 years ago

Many thanks for your time and patience, Jared. I think I was stuck at the fact that the "bigdata.py" generates a big dataset, but the regression only sample 8 points for 10,000 times. Therefore, the scripts at most uses 80,000 data points of 8 million and leave 7.92M data unused. Perhaps, as you suggested, the script could use a large _BATCH, like 1000, to simulate getting data feed from a large dataset.

jostmey commented 7 years ago

It works because the extra 7.92M datapoints are very similar to the first 80,000 datapoints. In circumstances where this is not the case, you might consider running stochastic gradient descent much, much longer to cover all the datapoints (serial) or use a larger batch size (parallel)

drzhouq commented 7 years ago

That's my point. Except generating extra 7.92M random point, the "bigdata.py" script is identical to the "tensorflow.py" script. Therefore, I am not sure I get the purpose of the "bigdata.py" script. :-)

jostmey commented 7 years ago

The point is to explain how to use placeholders. If you don't use placeholders, the amount of data that can be handled by TensorFlow is limited.