google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.14k stars 756 forks source link

Low TPU usage (under 7%) with default fine-tuning parameters, small model #15

Open anatoly-khomenko opened 4 years ago

anatoly-khomenko commented 4 years ago

The problem

Seems like TPU utilization is not effective. The CPU load in Google Cloud console is under 7% when fine tuning:

image

While performance of the fine-tuning seems to be pretty low (global_step/sec: 0.402755, examples/sec: 824.843):

WARNING:tensorflow:TPUPollingThread found TPU b't5-ex2' in state READY, and health HEALTHY. W1118 22:32:02.386583 140230713861888 preempted_hook.py:91] TPUPollingThread found TPU b't5-ex2' in state READY, and health HEALTHY. INFO:tensorflow:loss = 0.00076675415, step = 1001000 (248.289 sec) I1118 22:32:31.888029 140234883987200 basic_session_run_hooks.py:260] loss = 0.00076675415, step = 1001000 (248.289 sec) INFO:tensorflow:global_step/sec: 0.402755 I1118 22:32:31.890788 140234883987200 tpu_estimator.py:2307] global_step/sec: 0.402755 INFO:tensorflow:examples/sec: 824.843 I1118 22:32:31.891605 140234883987200 tpu_estimator.py:2308] examples/sec: 824.843 INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed. I1118 22:32:31.893662 140234883987200 tpu_estimator.py:600] Enqueue next (100) batch(es) of data to infeed. INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed. I1118 22:32:31.894027 140234883987200 tpu_estimator.py:604] Dequeue next (100) batch(es) of data from outfeed. I1118 22:32:32.458967 140230713861888 transport.py:157] Attempting refresh to obtain initial access_token WARNING:tensorflow:TPUPollingThread found TPU b't5-ex2' in state READY, and health HEALTHY.

How to reproduce

I use the following configuration, provided as an example of fine tuning:

export PROJECT=projectname export ZONE=us-central1-b export BUCKET=gs://uniquebucketname export TPU_NAME=t5-ex2 export DATA_DIR="${BUCKET}/t5-boolq-data-dir" export MODEL_DIR="${BUCKET}/t5_boolq-small-model_dir"

ctpu up --name=$TPU_NAME --project=$PROJECT --zone=$ZONE --tpu-size=v3-8 --tpu-only --tf-version=1.15.dev20190821 --noconf

t5_mesh_transformer --tpu="${TPU_NAME}" --gcp_project="${PROJECT}" --tpu_zone="${ZONE}" --model_dir="${MODEL_DIR}" --t5_tfds_data_dir="${DATA_DIR}" --gin_file="dataset.gin" --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" --gin_param="utils.tpu_mesh_shape.tpu_topology = '2x2'" --gin_param="MIXTURE_NAME = 'super_glue_boolq_v102'" --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin"

adarob commented 4 years ago

This means the model must be is I/O bound, due in part to its small size. We do tokenization and packing on the fly by default. I have a TODO to add support for cacheing smaller datasets in memory post-tokenization. Let me see if I can get to it today and you can try it out.

anatoly-khomenko commented 4 years ago

Thank you @adarob , let me know if I can help you to implement something. I'm launching the T5 for one of my current working tasks and I'm eager to make it train faster.

adarob commented 4 years ago

Can you try using the latest commit to see if this improves? It will cache the dataset on the first pass so the it will be much faster after that.

anatoly-khomenko commented 4 years ago

@adarob Hi Adam, Thank you for the update. I'm running it now (had to update python to 3.6 to be able to run the latest and it took a good part of the day). From what I see it did not improve, timing parameters are pretty much the same (global_step/sec: 0.401189, examples/sec: 821.635) , but strangely the TPU load decreased to 1%.

image

t5-copybara commented 4 years ago

Can you maybe hardcode it to use ds = ds.cache() just to make sure it is getting enabled?

On Tue, Nov 19, 2019 at 12:34 PM Anatoly Khomenko notifications@github.com wrote:

@adarob https://github.com/adarob Hi Adam, Thank you for the update. I'm running it now (had to update python to 3.6 to be able to run the latest and it took a good part of the day). From what I see it did not improve, timing parameters are pretty much the same (global_step/sec: 0.401189, examples/sec: 821.635) , but strangely the TPU load decreased to 1%.

[image: image] https://user-images.githubusercontent.com/4006428/69184052-19cd6800-0ae2-11ea-95d8-c9b80e134ccd.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/15?email_source=notifications&email_token=ANQSF2GU5JJJNRAL53APFKTQUREWPA5CNFSM4JO3DXM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEPVD4I#issuecomment-555700721, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANQSF2BWK4JPFZA5UDYDGYTQUREWPANCNFSM4JO3DXMQ .

-- You received this message because you are subscribed to the Google Groups "t5-copybara" group. To unsubscribe from this group and stop receiving emails from it, send an email to t5-copybara+unsubscribe@google.com. To post to this group, send email to t5-copybara@google.com. To view this discussion on the web visit https://groups.google.com/a/google.com/d/msgid/t5-copybara/google-research/text-to-text-transfer-transformer/issues/15/555700721%40github.com https://groups.google.com/a/google.com/d/msgid/t5-copybara/google-research/text-to-text-transfer-transformer/issues/15/555700721%40github.com?utm_medium=email&utm_source=footer .

anatoly-khomenko commented 4 years ago

Hi @t5-copybara, As far as I understand it is here: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/utils.py#L654

I will comment out condition and keep ds = ds.cache()

Please, let me know if this is the correct approach.

Thank you!

anatoly-khomenko commented 4 years ago

@t5-copybara ,

I have found that I could specify use_cached in command line parameters like this:

--gin_param="mesh_train_dataset_fn.use_cached = True"

This is the complete line:

t5_mesh_transformer --tpu="${TPU_NAME}" --gcp_project="${PROJECT}" --tpu_zone="${ZONE}" --model_dir="${MODEL_DIR}" --t5_tfds_data_dir="${DATA_DIR}" --gin_file="dataset.gin" --gin_param="mesh_train_dataset_fn.use_cached = True" --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" --gin_param="utils.tpu_mesh_shape.tpu_topology = '2x2'" --gin_param="MIXTURE_NAME = 'super_glue_boolq_v102'" --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin"

When I run it though, I get the exception here:

https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/utils.py#L602

Do you know where I can specify cache directories?

anatoly-khomenko commented 4 years ago

Ok, have found out that I can specify cache directories using: --additional_task_cache_dirs="${CACHE_DIR}" but in this case the cache does not get created as well.

This is the message I get:

22:18:44.970715 140004609750016 utils.py:584] 'super_glue_boolq_v102' does not exist in any task cache directories (searched ['gs://uniquebucketname/t5-boolq-data-dir-cache/super_glue_boolq_v102']).

Giving up for now.

adarob commented 4 years ago

The offline use_cached stuff is only supported on our internal infrastructure for the time being. What I added for you is something that will do the caching on the fly. You should be able to explicitly enable it as you mentioned above ("I will comment out condition and keep ds = ds.cache()"). Are you sure you're actually using this new code when you run and not what's in the pip package?

anatoly-khomenko commented 4 years ago

@adarob Hi Adam, I'll give it a try on another run. But I see several places with ds = ds.cache(), do I have to make the change to all of them?

I'm using latest master by installing from source with command:

pip install --upgrade -e ./text-to-text-transfer-transformer

And I see all the recent fixes there, so I'm pretty sure to be using the most recent version.

Will let you know as soon as I run it again.

f-lng commented 4 years ago

I am currently finetuning the large model on a 6GB TSV file and get a TPU usage of < 1% . Anything new here?

adarob commented 4 years ago

I'm not surprised that this would be I/O bound since your TSV is so large and would never be cached.

One thing to check is that your data and TPU are in the same region.

You can also shard your TSV into multiple files and pass in a path that can be globbed to find them all.

The ideal solution would be to pre-tokenize the TSV into sharded TFRecord files. It wouldn't be too hard to write a beam script to do this but it's not something I'll have time to do for a few weeks.

On Tue, Dec 17, 2019, 10:58 PM Fabian Langer notifications@github.com wrote:

I am currently finetuning the large model on a 6GB TSV file and get a TPU usage of < 1% . Anything new here?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/15?email_source=notifications&email_token=AAIJV2ENKDER4NOXOK3WHDDQZHCZHA5CNFSM4JO3DXM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHFCCVA#issuecomment-566894932, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJV2GI7PVVRRQ3O6VF6Y3QZHCZHANCNFSM4JO3DXMQ .

anatoly-khomenko commented 4 years ago

@f-lng , @adarob , I have ended up using the notebook provided here: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/notebooks/t5-trivia.ipynb

It seems to be using TPU more effectively. At least training that took more than 12 hours using the script from the issue, is accomplished in less than 4 hours using the notebook.

Size of the dataset is under 500Mb though.

@adarob , thank you for providing the notebook!

f-lng commented 4 years ago

@adarob I was not aware that the size of the TSV file could be an issue, I assumed that the code would just read chunks of it. Thank you for clarifying, I will try to pre-shard it.

@anatoly-khomenko Thanks for letting me know, I will have a look at the notebook as well.

adarob commented 4 years ago

It does read chunks, but if it's sharded it can read multiple chunks in parallel.

On Thu, Dec 19, 2019, 1:01 AM Fabian Langer notifications@github.com wrote:

@adarob https://github.com/adarob I was not aware that the size of the TSV file could be an issue, I assumed that the code would just read chunks of it. Thank you for clarifying, I will try to pre-shard it.

@anatoly-khomenko https://github.com/anatoly-khomenko Thanks for letting me know, I will have a look at the notebook as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/15?email_source=notifications&email_token=AAIJV2CUDK4GSQHK6UP5XH3QZMZ5HA5CNFSM4JO3DXM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHI5HIA#issuecomment-567399328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJV2CGZMEXUGBTACLJUIDQZMZ5HANCNFSM4JO3DXMQ .

f-lng commented 4 years ago

@adarob I did now take 7.000 examples from my dataset and precomputed 7 TFRecords files from it.

To create the TFRecords I directly used the T5 'Task' ( https://pastebin.com/36pG5ne4 )

I then adjusted your _get_cached_dataset function (and made sure its called) to load it ( https://pastebin.com/cvTzxEXh ) . The debug print is showing, so the function is called and the in-memory caching is also working.

I am using code that has been adapted from your Notebook to train the model ( https://pastebin.com/8am5S5H2 )

However, I am still getting a speed of ~50 examples/second and a TPU-CPU usage of <= 0.15% (sic) during training most of the time, with some spikes (1-2%)

grafik

( The naive approach of putting my huge TSV into the command line util gave me ~90 examples / sec. )

I do not have a lot of experience with the TF ecosystem except from some hacking around in Tensor2Tensor, and none with TPUs, so perhaps I am missing something important ?!

Btw I just checked, the buckets, the TPU and the VM are all in us-central1(-a) and it is a TPU v3-8.

f-lng commented 4 years ago

@adarob I did another experiment and set tokens_per_batch to 1024^2 and trained on ~250k datapoints. The examples/second stayed at ~50. (Also note that I got OOM errors with such high batch sizes when training using the CLI but did not get one this time. )

caffeinetoomuch commented 4 years ago

@adarob @anatoly-khomenko I am having a similar issue and I am trying to finetune on the GPU. When training, GPU usage is less than 7% and there is a huge usage of CPU(It has to be I/O bound). I also ended up using the following notebook even with the example data provided: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/notebooks/t5-trivia.ipynb Moreover, I even tried parameter mesh_train_dataset_fn.use_cached = True. Any suggestion or correction on what I might be doing wrong?

adarob commented 4 years ago

use_cached=True won't work unless you have run the cache_tasks_main preprocessing (https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/cache_tasks_main.py). You should also check that your data is being stored in the same region as your TPU/GPU. I'm not sure what else could be causing this issue.

NaxAlpha commented 4 years ago

Just to add here, (it is a hypothesis but) TPU has both a normal (ARM based or x86 based) CPU along with the accelerator (matrix multiplication units) and on cloud console it does not provide details about how much the accelerator is being used but it shows how much the CPU is being used.

In case of T5, TF-DS graph is uploaded to TPU, it uses TPU's CPU to execute non-deep learning ops like loading from GCS and preprocessing/tokenizing and then uses accelerator to do the deep learning training etc.

If you want to see how much TPU's accelerator is being used, you can use TPU profiling. In my case CPU was being used <0.1% but accelerator was being used at ~45%.