TF records for pretraining data

google-research / tapas

End-to-end neural table-text understanding models.

Apache License 2.0

1.15k stars 217 forks source link

TF records for pretraining data #41

Closed mdmustafizurrahman closed 4 years ago

mdmustafizurrahman commented 4 years ago

@muelletm @ebursztein Can you release the tfrecords data for pre-training? I want to develop a 6 layers pretrained checkpoint for TAPAS.

I am currently running the pre-training data generation in my single CPU machine and it has been running for 3 days. Does it take so long on CPU? Should it create any temporary files? Because I cannot see train.tfrecords and test.tfrecords file even though it has been running for 3 days.

Can we run pre-training data on GPU? I tried to allocate the gpu but the existing code is not using any?

mdmustafizurrahman commented 4 years ago

@muelletm I was trying to run the pre-training data generation code and it ran for 4 days and did consume almost 164 GB of RAM memory before it was killed by OS. It looks like the code is putting everything in memory? Am I correct? What are the way around?

ghost commented 4 years ago

Mhm, can you share the command you ran?

I am assuming you are using the local beam runner. I would be surprised if it put everything in memory.

Basically running on a single machine will take a very long time. You should either use Google Cloud as described (or another Apache beam back-end?) or manually split the data into smaller sets that can be processed by multiple machines in parallel.

mdmustafizurrahman commented 4 years ago

Yes, I was running in local machine, then I set up Google Dataflow but still it has been running for 2 days on GoogleDataflow.

ghost commented 4 years ago

On Dataflow you should be able to see how many machines are being used and what the progress is so far. Could share that?

Do be honest, I only tested the pipeline with the small sample. Processing that didn't take more than 10 minutes or so and most of that time is overhead (scheduling, etc ...). So, I would expect this should be much faster than 2 days when using an appropriate number of machines.

mdmustafizurrahman commented 4 years ago

So far it is still running in Dataflow with 3 machines. Here is the screenshot of

mdmustafizurrahman commented 4 years ago

@thomasmueller-google It would be really great if you could share the pre-trained data in TF Format. I really want to develope a pretrained checkpont using 4 and 6 layers BERT.

mdmustafizurrahman commented 4 years ago

@thomasmueller-google My pretraining data generation is completed on GoogleDataflow. It took 4 days and 9 hrs.

ghost commented 4 years ago

Great to hear! This was using 3 machines, right? It should scale nicely with the number of machines used so using 12 machines it would only take 1 day and so on.