Running Real Data on TensorFlow

chrisjbaik commented 8 years ago

@paivaspol @zainahamid @arquinn

Current Status

Okay, so there are a few different changes we made to the neural network to get it to run better with our toy example. At first, we had been getting results of 0.1 accuracy consistently with no change in cross-entropy, which is bad because 0.1 accuracy for 10 result buckets means that you're basically performing at the same capability as randomly selecting a result.

Choice of training samples. We were initially randomly generating a [batch_size, words_per_function] matrix for every iteration, but doing this random generation per iteration was adding a lot of noise. Instead of this, we selected our batch_size as 16, then created a training dataset of 100 functions. Every iteration, we randomly select a new sample batch of 16 from the training set. This is actually the appropriate way to do it - you want to run multiple iterations over different samples from the training set, but what I was mistakenly doing was recreating a new training set every time, making it impossible to train on it. This SO question helped with understanding what I needed.
Changing the optimizer. Selecting the correct learning rate is a very difficult problem for a vanilla gradient descent optimizer. After attempting for a little bit with that, we gave up and switched to the AdamOptimizer which is an adaptive learning rate method and basically a black box for optimization. Results improved and after 5000-6000 iterations, we get 93%-100% accuracy on every new sample given (I don't think we need this level of accuracy and it is probably overfitting at this point).
Items to consider moving forward
Increased complexity. With the increased complexity of the training data, I assume that we're going to need to tinker around with parameters on the neural network. What's the correct one? I have no idea.
Size of training data. @paivaspol, @arquinn how large do you expect the training data to be? In order to get a good model, I think we're gonna need a ton of good data. And the more overlap in patterns there are, the better model it's going to develop (so a more condensed representation is preferable in my opinion). There's a balance between losing information in a particular representation and having more information become noise.
Execution on GPU. This might be a requirement, because running the actual data on the CPU is super time/processor intensive. Reading about RNNs generally talks about things in the order of hours even when running on a CPU, and without a GPU it may even be 1000s of hours. Not good. We could try to purchase an Amazon EC2 spot instance for a GPU thing - it can range from anywhere from $.07 to $2.00 an hour. I don't think we should pay more than $.10 if we do use it.
Additional Reading on Neural Networks

http://neuralnetworksanddeeplearning.com/ http://karpathy.github.io/2015/05/21/rnn-effectiveness/ http://colah.github.io/posts/2015-08-Understanding-LSTMs/ http://deeplearning.net/tutorial/lstm.html

arquinn commented 8 years ago

Great! it all sounds good. Right now we're still working on getting all of the source data, and so much of our data lives in bucket 0... but right now I have 2546 different functions for which we have both source and valgrind output. There's another about double that which we have valgrind output for but no source.

For the representation, I tokenized our source and have found about 50,000 distinct tokens in the files. I'm happy to play around with further reducing the token names... right now I have a number of c/c++ language identifiers that I count as special tokens, and then every other word counts as a separate token. There are a bunch of ways we could restrict this down, lets talk more about it tomorrow.

I don't have a great answer to the execution time issue... I think we'll have to take it in stride and see how it goes? 1000s of hours obviously doesn't work, but I have no idea about using EC2. If we can get the job done on there, then I don't see an issue chipping in my fair share of 2.5 cents :P

paivaspol commented 8 years ago

Cool! I'll try to get the rest of the functions in before our meeting tomorrow. I'm also fine with running it on EC2 to get it done :+1:

chrisjbaik commented 8 years ago

Status update: trying to get it running on EC2, but ran into a roadblock because I was following this script to download it but the NVidia CUDNN requires a special developer's account which I registered for but I need to wait a couple days to see if they approve it.

chrisjbaik commented 8 years ago

Side note: Why we use cross-entropy:

https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/

chrisjbaik commented 8 years ago

Status update: got it running on EC2 GPU after receiving developer account. I run the script though, and it is no faster running on a GPU than on a CPU. Probably has something to do with the implementation - it might need some work to make it run faster/parallelize it for the GPU. I'm not even sure if it's possible or what we'd need to do. Documentation online and Stack Overflow is sparse.

chrisjbaik commented 8 years ago

Also created an AMI on us-east for AWS with our repo and GPU installation set up so we can quickly get it running if needed: ami-a7561fcd

chrisjbaik commented 8 years ago

Tried running a new configuration with our "mod 10" toy dataset, this time with the following parameters:

training set size: 100000
vocabulary size: 10000
cells in the training set are randomly initialized to a word in the vocabulary
labels are sum of all words in observation mod 10
other parameters:
- hidden LSTM layer size: 128
- num epochs: 15
- batch size per iteration in epoch: 20
- words per function: 100
- learning rate: 0.001
- optimizer: Adam
- dropout keep probability: 0.7
- forget bias on LSTM: 1.0

Results are that I end up around 40% accuracy on the training data after 15 epochs, where each epoch takes around 2000s ~= 30 min, which means around 7.5 hours for the entire thing on my own machine.

arquinn commented 8 years ago

Hey sorry, I was going to merge in stuff for you to run tonight, but I got caught up in getting new data on unoptimized code. I'm hoping this will give us a better bucket distribution... but it does make valgrind take substantially longer to run. So, I won't have results merged and pushed until tomorrow morning... but they will be pretty accurate at that time.

chrisjbaik commented 8 years ago

Okay. Any updates @arquinn ?

Updates from my end:

Rewrote a lot of the program: nn/main.py
Program now outputs distance metric alongside cross-entropy and accuracy
Process is split into three parts, training, validation, and test. Training runs epoch-by-epoch, validation set is used at the end of each epoch to spot-check results, and test is run after fully trained model for final results
For test set, execution will produce 3 files:
- data.csv - input data for test set, [num_functions, num_tokens_per_function]
- actual.csv - actual labels, a one-hot vector form, [num_functions, num_result_buckets]
- predicted.csv - predicted probability for each label, [num_functions, num_result_buckets]
- to compare predicted and actual for accuracy, you should choose the column in each row with the maximum value in each file and compare the resulting [num_functions, 1] results

chrisjbaik commented 8 years ago

Also, I tried running on an AWS c4.large instance. Comparison for per-epoch execution:

c4.large: 1313s ~= 22 mins
MacBook Pro (3.1 GHz Core i7): 2080s ~= 35 mins

Again, GPU execution doesn't significantly improve from my MacBook Pro, but a faster CPU helps.

arquinn commented 8 years ago

Yep, sorry. The master branch has a reasonable test set. Two other things are going on:

I am re-running valgrind with no optimization on. Hopefully this gives us a better spread of cache performances. But boy oh boy is it slow.. valgrind is slugging along, 4 benchmarks left to complete, I might not even complete it all tonight. You can use what is already out there though, it just is likely to not be as spread out as the results I'm getting in my new testing.
I am retokenizing our code.. Turns out that there were a few backedges that I had missed (whoops). Its a fairly small number, and I'm somewhat confident that it will minimally change the dataset... but there will be small changes.

chrisjbaik commented 8 years ago

Okay, a few questions for y'all:

I'm encountering a few problems on my Mac OS X machine:
Turns out there's two copies of each of these files with various case variations (some chars are uppercase, others are lowercase). Judging by this link, we need to either delete the duplicate versions, or rename them. Are they the same files, or are they completely different? If so, can we rename them to be different names completely, not just case-distinguished filenames?
The number of tokens per function seems to be around ~20000, which makes the LSTM 20000 steps long. This is prohibitively slow, so I'm wondering if there's a few changes we can make. I think the LSTM steps should be more like 100. Which causes a huge problem... I have a few proposals here:
- Truncate the function at 100 tokens. This doesn't seem very smart.
- Somehow condense the representation. That could mean a lot of things, like ignoring a bunch of tokens that are irrelevant (i.e. only considering load/store/branch).. I don't know. Ideas? help :(

chrisjbaik commented 8 years ago

I recommend a token length of 100-200. Anything beyond that get substantially slower. :confused:

arquinn commented 8 years ago

So I just made a push. Here is where we're at:

There are two different data_modules now. Both data_modules have a parameter that can be passed in the constructor "max_tokens" which specifies the number of tokens that the maximum function can have, defaulting to 200. We drop all other functions. The new data_module operates on function tokens that only include load/store blah blah. Its a subset of the original tokens.

Hope this helps! We are up to about a hundred thousand functions at this point...

chrisjbaik commented 8 years ago

okay, it's up and running right now. I really hope it works. It takes around 3000-5000s per epoch (50-80 min). I really hope that there's no errors involved because that would stink. I'm shooting for 40 epochs, which should take around 53 hours total. So... I REALLY HOPE THERE'S NO ERRORS.

In any case, the baseline gets around 50% on the first epoch. Do you guys have any analytics on what the data looks like as to why that might be?

chrisjbaik commented 8 years ago

For some reason each epoch slows down. I'm not sure if it's a resource allocation issue on Amazon VMs, or the algorithm is slowing down/consuming too much memory.. Hmm.

arquinn commented 8 years ago

yep, check the presentation. There are some numbers on what the dataset looks like under the results section somewhere

On Dec 11, 2015, at 11:53 PM, Chris Baik notifications@github.com wrote:

okay, it's up and running right now. I really hope it works. It takes around 3000-5000s per epoch (50-80 min). I really hope that there's no errors involved because that would stink. I'm shooting for 40 epochs, which should take around 53 hours total. So... I REALLY HOPE THERE'S NO ERRORS.

In any case, the baseline gets around 50% on the first epoch. Do you guys have any analytics on what the data looks like as to why that might be?

— Reply to this email directly or view it on GitHub https://github.com/paivaspol/EECS583/issues/6#issuecomment-164110505.

arquinn commented 8 years ago

so… I hate the be the bearer of bad news, But our data is currently totally screwed up. The c++ stuff messed up a bunch of our parsing scripts, and so I had to roll some stuff back. I’m down to only 4k functions, but I know that they’re actually real results (whereas I’m convinced the data under master is messed up). I don’t know how we didn’t catch this until now… my fault I think

pushed accurate sources to master. sorry all! yikes

On Dec 11, 2015, at 11:53 PM, Chris Baik notifications@github.com wrote:

okay, it's up and running right now. I really hope it works. It takes around 3000-5000s per epoch (50-80 min). I really hope that there's no errors involved because that would stink. I'm shooting for 40 epochs, which should take around 53 hours total. So... I REALLY HOPE THERE'S NO ERRORS.

In any case, the baseline gets around 50% on the first epoch. Do you guys have any analytics on what the data looks like as to why that might be?

— Reply to this email directly or view it on GitHub https://github.com/paivaspol/EECS583/issues/6#issuecomment-164110505.

zainahamid commented 8 years ago

I was speaking with Chris a couple of mins ago, and I guess he's asleep now. I just pulled the data and started the execution on my local.

Epoch: 1 Learning rate: 0.001 step: 0, accuracy: 0, distance: 2.4, xent: 46.9381

chrisjbaik commented 8 years ago

okay rerunning as well

chrisjbaik commented 8 years ago

Had a little issue with data loading. Re-running again. I got to end of 40 epochs, but with the small size of training data, that's NOT enough for the model to converge. So... Instead, I'm running 1000 epochs. I am not sure this will help but let's see what happens.

arquinn commented 8 years ago

Yeah.. We might just wind up with an 'ideas' paper...

Sent from my iPhone

On Dec 12, 2015, at 8:58 AM, Chris Baik notifications@github.com wrote:

Had a little issue with data loading. Re-running again. I got to end of 40 epochs, but with the small size of training data, that's NOT enough for the model to converge. So... Instead, I'm running 1000 epochs. I am not sure this will help but let's see what happens.

— Reply to this email directly or view it on GitHub.

paivaspol commented 8 years ago

I think that's fine. I'll be going through the slides today and finish up the the dataset subsection in the evaluation section.

I can also run the thing locally with a different setting just give me the setting and I can start running that.

Vaspol

On Sat, Dec 12, 2015 at 9:21 AM, arquinn notifications@github.com wrote:

Yeah.. We might just wind up with an 'ideas' paper...

Sent from my iPhone

On Dec 12, 2015, at 8:58 AM, Chris Baik notifications@github.com wrote:

Had a little issue with data loading. Re-running again. I got to end of 40 epochs, but with the small size of training data, that's NOT enough for the model to converge. So... Instead, I'm running 1000 epochs. I am not sure this will help but let's see what happens.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/paivaspol/EECS583/issues/6#issuecomment-164158099.

zainahamid commented 8 years ago

Still getting the Value 0.0 error within Epoch 1 post Step 140, can't figure it out. @chrisjbaik any clue?

chrisjbaik commented 8 years ago

I have no problem running the stuff on an EC2 box. Wondering if it's a Mac OS X issue?

I'm currently on Epoch 272, training accuracy is around 90.5%. Might've been smart to include a validation set to have an unskewed number for the accuracy improvements, but oh well.

arquinn commented 8 years ago

I get that those numbers are likely skewed, but 90% seems insanely good! Can you run the accuracy on the test set periodically, or does it have to be all the way after we've trained the network?

Sent from my iPhone

On Dec 12, 2015, at 2:34 PM, Chris Baik notifications@github.com wrote:

I have no problem running the stuff on an EC2 box. Wondering if it's a Mac OS X issue?

I'm currently on Epoch 272, training accuracy is around 90.5%. Might've been smart to include a validation set to have an unskewed number for the accuracy improvements, but oh well.

— Reply to this email directly or view it on GitHub.

chrisjbaik commented 8 years ago

On Epoch 370, it's at 91.2%. This is not run on the test set, I will do that after it's all trained; that's why I mentioned it'd be good to have a validation set to double-check per epoch. oops.

paivaspol commented 8 years ago

So we will be able to run with the test set after 1000 epochs are completed?

On Sat, Dec 12, 2015 at 4:34 PM, Chris Baik notifications@github.com wrote:

On Epoch 370, it's at 91.2%. This is not run on the test set, I will do that after it's all trained; that's why I mentioned it'd be good to have a validation set to double-check per epoch. oops.

— Reply to this email directly or view it on GitHub https://github.com/paivaspol/EECS583/issues/6#issuecomment-164195432.

chrisjbaik commented 8 years ago

Um, more like it'll run automatically when I run 1000 epochs. Perhaps I should have just stopped it at 350. It is actually faster at this point for me to restart and run it to 350 epochs. Maybe I will do this, lol.

On Sat, Dec 12, 2015 at 4:45 PM Vaspol Ruamviboonsuk < notifications@github.com> wrote:

So we will be able to run with the test set after 1000 epochs are completed?

On Sat, Dec 12, 2015 at 4:34 PM, Chris Baik notifications@github.com wrote:

On Epoch 370, it's at 91.2%. This is not run on the test set, I will do that after it's all trained; that's why I mentioned it'd be good to have a validation set to double-check per epoch. oops.

— Reply to this email directly or view it on GitHub https://github.com/paivaspol/EECS583/issues/6#issuecomment-164195432.

— Reply to this email directly or view it on GitHub https://github.com/paivaspol/EECS583/issues/6#issuecomment-164196018.

Chris

chrisjbaik commented 8 years ago

I will launch a second machine instead and do both. :smile:

zainahamid commented 8 years ago

I've been trying to run on my local, but it continues giving the error on the 140th step, prolly some issue running on mac, coz its the same updated code and I don't see what's being weird.

paivaspol commented 8 years ago

Chris, if you're doing that may be you can add a validation set as well, if that's not too complicated for you to do.

On Sat, Dec 12, 2015 at 4:57 PM, Zaina Hamid notifications@github.com wrote:

I've been trying to run on my local, but it continues giving the error on the 140th step, prolly some issue running on mac, coz its the same updated code and I don't see what's being weird.

— Reply to this email directly or view it on GitHub https://github.com/paivaspol/EECS583/issues/6#issuecomment-164196630.

chrisjbaik commented 8 years ago

Didn't get a chance to add the validation.. running again though! :)

chrisjbaik commented 8 years ago

Sigh. Had a little bug at the end of it. I re-tested with 1 epoch and should be working now. Also added a validation set. Will re-run one last time (hopefully)

chrisjbaik commented 8 years ago

For some weird reason it died by itself with just a notice that says "Killed" on Epoch 224. Either way, results don't look too promising. I get Training Accuracy at 87.6% and Validation Accuracy at the same point is 29%.

I reran it again, but we should start coming up with a contingency plan for how to present and assess the evaluation.

paivaspol commented 8 years ago

Can we run with a low epoch that we know will finish like 200? I know the accuracy will suck but at least to get the number out there?

For the presentation, we can present something like we were trying out an idea, but it didn't work out like what we expected. We can probably try to explain why it didn't work out too.

On Sun, Dec 13, 2015 at 3:49 PM, Chris Baik notifications@github.com wrote:

For some weird reason it died by itself with just a notice that says "Killed" on Epoch 224. Either way, results don't look too promising. I get Training Accuracy at 87.6% and Validation Accuracy at the same point is 29%.

I reran it again, but we should start coming up with a contingency plan for how to present and assess the evaluation.

— Reply to this email directly or view it on GitHub https://github.com/paivaspol/EECS583/issues/6#issuecomment-164296797.

arquinn commented 8 years ago

Well, I think we still need to tease out how it failed. Metrics like average distance etc. will help to explain our issue, especially since the buckets are relatively close together. Potentially 29% isn't actually THAT bad if you look at how far apart the buckets are... Not sure.

Sent from my iPhone

On Dec 13, 2015, at 4:10 PM, Vaspol Ruamviboonsuk notifications@github.com wrote:

Can we run with a low epoch that we know will finish like 200? I know the accuracy will suck but at least to get the number out there?

For the presentation, we can present something like we were trying out an idea, but it didn't work out like what we expected. We can probably try to explain why it didn't work out too.

On Sun, Dec 13, 2015 at 3:49 PM, Chris Baik notifications@github.com wrote:

For some weird reason it died by itself with just a notice that says "Killed" on Epoch 224. Either way, results don't look too promising. I get Training Accuracy at 87.6% and Validation Accuracy at the same point is 29%.

I reran it again, but we should start coming up with a contingency plan for how to present and assess the evaluation.

— Reply to this email directly or view it on GitHub https://github.com/paivaspol/EECS583/issues/6#issuecomment-164296797.

— Reply to this email directly or view it on GitHub.

chrisjbaik commented 8 years ago

Okay I'll get back to y'all later tonight with some of the data and results. Sorry about the troubles.

On Sun, Dec 13, 2015 at 4:34 PM arquinn notifications@github.com wrote:

Well, I think we still need to tease out how it failed. Metrics like average distance etc. will help to explain our issue, especially since the buckets are relatively close together. Potentially 29% isn't actually THAT bad if you look at how far apart the buckets are... Not sure.

Sent from my iPhone

On Dec 13, 2015, at 4:10 PM, Vaspol Ruamviboonsuk < notifications@github.com> wrote:

Can we run with a low epoch that we know will finish like 200? I know the accuracy will suck but at least to get the number out there?

For the presentation, we can present something like we were trying out an idea, but it didn't work out like what we expected. We can probably try to explain why it didn't work out too.

On Sun, Dec 13, 2015 at 3:49 PM, Chris Baik notifications@github.com wrote:

For some weird reason it died by itself with just a notice that says "Killed" on Epoch 224. Either way, results don't look too promising. I get Training Accuracy at 87.6% and Validation Accuracy at the same point is 29%.

I reran it again, but we should start coming up with a contingency plan for how to present and assess the evaluation.

— Reply to this email directly or view it on GitHub <https://github.com/paivaspol/EECS583/issues/6#issuecomment-164296797 .

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/paivaspol/EECS583/issues/6#issuecomment-164300109.

Chris

chrisjbaik commented 8 years ago

@paivaspol yeah, I'm rerunning with 200 epochs now. Sorry, cutting it tight... I can't get the data at the moment. It kept on dying at around 223 epochs for an unknown reason...

chrisjbaik commented 8 years ago

I guess it was dying because of memory overload as per this link. sigh.

chrisjbaik commented 8 years ago

Final results after 200 epoch execution:

Test accuracy: 27.3% Test distance: 2.646 Test cross-entropy per observation: 3.816 (this actually doesn't mean much)

chrisjbaik commented 8 years ago

https://dl.dropboxusercontent.com/u/20010067/nn_results.tar.gz

Here's the results for the data:

data.csv - input features
actual.csv - actual true labels
predicted.csv - predicted probability per label

paivaspol / EECS583

Running Real Data on TensorFlow #6

Current Status

Items to consider moving forward

Additional Reading on Neural Networks