Open raequan opened 4 years ago
Yes, BERT is about 100x slower than the other models we have used. Training the model to convergence will take several days on a normal laptop.
Fortunately, you are not required to train to convergence. You only have to train long enough to reach the accuracy levels in the assignment, which should take 2-3 hours with a good choice of hyperparameters.
Do you happen to have Zoom office hours today to discuss this?
I'm on office hours right now.
I'm waiting to be let inside
@mikeizbicki Do you still have any time today after office hours? I am very stuck as of where to begin implementing some of the code.
@raequan Sorry, I had to leave for another meeting before you showed up.
@raequan @benfig1127 I'll have office hours tomorrow morning at 9am.
@mikeizbicki sounds good, I managed to solve some of the issues but still had a few questions, so I will swing by. Thanks!
@mikeizbicki I have three questions:
What is the lowest smoothing value we can use on tensor board? When I have my value set to .999 I am much farther from the targets when compared to using the smoothing value .984. Would it matter what smoothing value we use, or should we reach these benchmarks with a smoothing value of 0.999. I understand what the smoothing value does (in terms of plotting), but my computer is taking longer to run and I predict that if it hits that value with a lower smoothing value, it is bound to with a higher smoothing value with more samples.
How many runs do we need to have on our tensor board upload? Would it be fine for me to only upload the light blue line on my tensor board.dev?
My loss value increased then peaked and slowly started to decrease after its only peak. This is indicative of my loss working, correct?
Lastly, you can see how long my BERT is taking to train; specifically the blue line, 22 hours to get to 20% but it is working. I'm running nothing else. Is this normal?
Here is my tensorboard.dev so you can see what I mean: https://tensorboard.dev/experiment/yTzAPgNjRHKEjS5vybEn9A/
Any smoothing value is fine. 0.99 would be a good choice.
You only need a single run. There is no need for warm starting.
Your loss value is pretty high. It should typically be much lower than where it started at. For this problem, however, I'm not grading your loss value. I'm only grading your accuracy.
With better hyperparameters, you could get it to converge to the required values in just 2-3 hours. But you will not be graded on the runtime.
Thank you! This was all good to know. Last question I believe I have:
I have implemented the embed function as my code runs. I think I got it to work, but we haven't looked at projections before. Given a very poor run for the warm start, I got the following projection on my tensor board.dev:
Does this look correct? I plan to redo it with my model that is working once it reaches the benchmarks, but want to know if this is how it should be looking. I think it's what we should have because my search of labels shows points related to news articles, but just want to be sure. Can ou explain the graph?
You'll want to use tsne, and you'll need to adjust the hyperparameters until you get some clusters forming.
When II reach the benchmarks, presumably there will be clusters correct?
I think you could probably already have clusters if you use the tsne algorithm with the right paramters.
Running Bert on my computer has taken extremely long; I have only 2.4K steps after 12 hours. Are there any ways to optimize the speed of this? This is the only application I have running as well, as my computer says I have no application space when I run more than 2 scripts at a time.