Running C4 dataset pipeline on Cloud Dataflow - running time and resources

shepsels commented 4 years ago

What I need help with / What I was wondering I'm running the C4 Dataflow pipeline as described in this guide: https://www.tensorflow.org/datasets/beam_datasets. At first, I ran it without any restrictions, and it tried to scale up, until it used all of our free addresses across our entire Gcloud account. On the second run, we set max_workers to 20. It's running for quite some time (~72h) and we do not have any way to estimate for how long it will run, if there's any error (no errors are shown in logs).

We'll be happy to understand if that's reasonable running time, and to get some ways to inspect this pipeline and figure out our progress.

Thank you.

Environment information (if applicable)

Python version: 3.7.4
tensorflow-datasets/tfds-nightly version: tfds-nightly
tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tensorflow 2

Conchylicultor commented 4 years ago

@adarob FYI

adarob commented 4 years ago

I recently added this information to the T5 README (https://github.com/google-research/text-to-text-transfer-transformer#c4):

C4

The [C4][c4] dataset we created for unsupervised pre-training is available in TensorFlow Datasets, but it requires a significant amount of bandwith for downloading the raw [Common Crawl][cc] scrapes (~7 TB) and compute for its preparation (~341 CPU-days). We suggest you take advantage of the [Apache Beam][beam] support in TFDS, which enables distributed preprocessing of the dataset and can be run on [Google Cloud Dataflow][gcd]. With 450 workers, the job should complete in ~18 hours.

After defining MY_PROJECT and MY_BUCKET appropriately, you can build the datast in DataFlow from GCP using the following commands:

pip install tfds-nightly[c4,gcp]
echo 'tfds-nightly[c4]' > beam_requirements.txt
python -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=c4/en \
  --data_dir=gs://$MY_BUCKET/tensorflow_datasets \
  --beam_pipeline_options="project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service"

Conchylicultor commented 4 years ago

@adarob Maybe we should also add the doc in c4 description, or at least a link to the instructions: https://www.tensorflow.org/datasets/catalog/c4

shepsels commented 4 years ago

Thanks @adarob @Conchylicultor. We are using beam and download_and_prepare script. Good to know That nothing is stuck, just some more time to wait. And yes, I think it will be very useful to add that info to the main c4 info page, I guess that if I knew this before I would find a way to run this through the weekend on more CPUs.

Thank you again, Paz.

shepsels commented 4 years ago

Hi again, I have a follow-up question. We started running the Dataflow pipeline ~25 hours ago while following your recommendations, with 7*64=448 CPU cores. The pipeline already used 10767 vCPU hours, which is ~450 CPU days. Is there a way for us to figure out for how long it will still be running? Or to understand if something has gone wrong?

Thanks again, Paz.

adarob commented 4 years ago

Did you enable the shuffle service?

On Fri, Apr 24, 2020 at 9:05 AM shepsels notifications@github.com wrote:

Reopened #1931 https://github.com/tensorflow/datasets/issues/1931.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/datasets/issues/1931#event-3270073154, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJV2E4JIUDW2PQ4LJMT3TROG2EVANCNFSM4MOGWJSA .

shepsels commented 4 years ago

@adarob Not explicitly, no.

shepsels commented 4 years ago

@adarob Is this crucial (should I start again?) or it'll just take some more time?

adarob commented 4 years ago

What exact command did you use to launch? I haven't tried without shuffle service. It will certainly take longer and will use more memory, which could possibly cause you to oom. It's worth a try though!

On Fri, Apr 24, 2020, 9:39 AM shepsels notifications@github.com wrote:

@adarob https://github.com/adarob Is this crucial (should I start again?) or it'll just take some more time?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/datasets/issues/1931#issuecomment-619120846, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJV2BL2XLVBTWVWSSURD3ROG6ERANCNFSM4MOGWJSA .

shepsels commented 4 years ago

I'd like to run it again with the correct parameters. But after it ran for so long before, I want to be 100% sure this configuration is complete and optimal. Can you please take a look at this and let me know if something is missing, or not optimal?

python -m tensorflow_datasets.scripts.download_and_prepare \ --datasets=c4/en \ --data_dir=gs://$MY_BUCKET/tensorflow_datasets \ --beam_pipeline_options="machine_type=n1-standard-64,disk_size_gb=5000,max_num_workers=7,project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service"

I added these: machine_type=n1-standard-64,disk_size_gb=5000,max_num_workers=7 because we have a quota on the number of the machines in our billing account, and the default machine was n1-standard-1 (which will require ~450 machines), so we would like to use 7 64-core machines instead. I also set disk size because I understood that the default is 250GB per worker and I have fewer workers. Anything wrong with that? Will it do the work? Thanks.

adarob commented 4 years ago

I'm not positive it will use multiple cores on each machine by default due to the python global lock. As long as you had experiments=shuffle_mode=service with 450 workers before, you should have been using the same setup as I did which completed in less than a day.

On Fri, Apr 24, 2020 at 10:37 AM shepsels notifications@github.com wrote:

I'd like to run it again with the correct parameters. But after it ran for so long before, I want to be 100% sure this configuration is complete and optimal. Can you please take a look at this and let me know if something is missing, or not optimal?

python -m tensorflow_datasets.scripts.download_and_prepare \ --datasets=c4/en \ --data_dir=gs://$MY_BUCKET/tensorflow_datasets \ --beam_pipeline_options="machine_type=n1-standard-64,disk_size_gb=5000,max_num_workers=7,project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service"

I added these: machine_type=n1-standard-64,disk_size_gb=5000,max_num_workers=7 because we have a quota on the number of the machines in our billing account, and the default machine was n1-standard-1 (which will require ~450 machines), so we would like to use 7 64-core machines instead. I also set disk size because I understood that the default is 250GB per worker and I have fewer workers. Anything wrong with that? Will it do the work? Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/datasets/issues/1931#issuecomment-619151614, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJV2FOZJLVNFR4XLI4KG3ROHE4XANCNFSM4MOGWJSA .

roynirmal commented 4 years ago

@shepsels how long did it take ultimately to finish the code? I also cannot set 450 workers due to quota limit so I am planning to use the same beam_pipeline_options as yours

shepsels commented 4 years ago

@roynirmal After some struggling I raised the quota and ran it as @adarob suggested. It took less than 24 hours.

roynirmal commented 4 years ago

@shepsels That's good to know! Can you tell me how were you successful in raising the quota? I raised the CPU quota for us-central1 to 450. However I am getting maxed out at 32 CPUs since apparently that's the global quota. I asked to raise it but got rejected. Any help is appreciated since I am running a deadline.

adarob commented 4 years ago

If anyone wanted to host the dataset in a public bucket for others to share, it is essentially free for the host with Requester Pays (https://cloud.google.com/storage/docs/requester-pays).

roynirmal commented 4 years ago

@adarob good to know, I can host it once I have finished the download. With my current quota looks like it will take around 10 days :\

theTB commented 4 years ago

@roynirmal @shepsels were you able to host the data by any chance? Would be really helpful, thanks!

wnagele commented 4 years ago

I would also really appreciate if I could use this dataset already processed. The cost of processing all of this is too high for my testing.

roynirmal commented 4 years ago

Hey @theTB @wnagele, I did not ultimately download it since the cost was too much! But I am open to the idea of sharing the cost to download the data.

theTB commented 4 years ago

Did you figure out the resources and cost required for processing? I tried using the free credits but I keep hitting a quota limit at the number of in-use IP addresses which doesn't allow me to scale to more workers (even though I have a lot more vCPU available in my quota). Is there a workaround to not use so many IP addresses?

roynirmal commented 4 years ago

Processing the whole dataset will definitely eat up the entire free credits. I think Cloud Dataflow charges exorbitantly. I am not sure how much it will cost us if we can indeed download the data in less than 24 hours. I also was allowed only 1 IP address, so the workaround is to use a single machine with multiple cores

prashant-kikani commented 4 years ago

Did anyone has uploaded the cleaned C4 data with size ~750 GB?

I also wanted to download only the clean C4 data. Not the entire 7 TB CC data.

Thanks.

adarob commented 4 years ago

@craffel

feather820 commented 3 years ago

@roynirmal After some struggling I raised the quota and ran it as @adarob suggested. It took less than 24 hours.

How much money did you cost to download C4 dataset, I want to download it, but I'm afraid it will cost a lot.

amchauhan commented 3 years ago

Hi @adarob @shepsels any estimates on processing mC4 (multilingual one which is ≈ 26TB) on GCP both in terms of time and cost?

spate141 commented 1 year ago

@adarob @shepsels @feather820 @amchauhan Came here by stumbling upon many threads. If anyone has successfully processed CC data; can we please have some numbers in terms of resources used and time it took to process the data?

craffel commented 1 year ago

FYI, you can download a prepared and preprocessed C4 directly now: https://huggingface.co/datasets/allenai/c4

spate141 commented 1 year ago

Thanks @craffel but I have a use case which requires processing the latest CC data dumps. C4 from Allen seems like the April 2019 version.

kimsan0622 commented 1 year ago

Hello, @spate141 , I strongly recommend that you use AI2's preprocessed C4 release as @craffel mentioned (I also have used it). But if you want to process a new Common Crawl dump, you will consume about 3,500 bucks for one WET file (I consumed 3,500 bucks to process one WET file with 1024 cores and Dataflow API on the 'C4 en' split).

spate141 commented 1 year ago

Hi @kimsan0622 It looks like using 75 workers with 105TB and almost 2 billion files of the CC-MAIN-2013-20 dump costs about 950€. Did you mean 3500 USD?

kimsan0622 commented 1 year ago

@spate141 There was a mistake in the cost calculation. It cost 3500 USD to process 2 WET files with Dataflow.

versae commented 1 year ago

FWIW, these days it might be useful to try the olm-datasets approach, i.e., a single massive instance doing it all.

spate141 commented 1 year ago

Thanks @versae! olm-datasets seems pretty straightforward for processing CC data, that's exactly what I was looking for!

tensorflow / datasets

Running C4 dataset pipeline on Cloud Dataflow - running time and resources #1931

C4