Closed mdmustafizurrahman closed 4 years ago
@muelletm I was trying to run the pre-training data generation code and it ran for 4 days and did consume almost 164 GB of RAM memory before it was killed by OS. It looks like the code is putting everything in memory? Am I correct? What are the way around?
Mhm, can you share the command you ran?
I am assuming you are using the local beam runner. I would be surprised if it put everything in memory.
Basically running on a single machine will take a very long time. You should either use Google Cloud as described (or another Apache beam back-end?) or manually split the data into smaller sets that can be processed by multiple machines in parallel.
Yes, I was running in local machine, then I set up Google Dataflow but still it has been running for 2 days on GoogleDataflow.
On Dataflow you should be able to see how many machines are being used and what the progress is so far. Could share that?
Do be honest, I only tested the pipeline with the small sample. Processing that didn't take more than 10 minutes or so and most of that time is overhead (scheduling, etc ...). So, I would expect this should be much faster than 2 days when using an appropriate number of machines.
So far it is still running in Dataflow with 3 machines. Here is the screenshot of
@thomasmueller-google It would be really great if you could share the pre-trained data in TF Format. I really want to develope a pretrained checkpont using 4 and 6 layers BERT.
@thomasmueller-google My pretraining data generation is completed on GoogleDataflow. It took 4 days and 9 hrs.
Great to hear! This was using 3 machines, right? It should scale nicely with the number of machines used so using 12 machines it would only take 1 day and so on.
@muelletm @ebursztein Can you release the tfrecords data for pre-training? I want to develop a 6 layers pretrained checkpoint for TAPAS.
I am currently running the pre-training data generation in my single CPU machine and it has been running for 3 days. Does it take so long on CPU? Should it create any temporary files? Because I cannot see train.tfrecords and test.tfrecords file even though it has been running for 3 days.
Can we run pre-training data on GPU? I tried to allocate the gpu but the existing code is not using any?