Open anatoly-khomenko opened 5 years ago
This means the model must be is I/O bound, due in part to its small size. We do tokenization and packing on the fly by default. I have a TODO to add support for cacheing smaller datasets in memory post-tokenization. Let me see if I can get to it today and you can try it out.
Thank you @adarob , let me know if I can help you to implement something. I'm launching the T5 for one of my current working tasks and I'm eager to make it train faster.
Can you try using the latest commit to see if this improves? It will cache the dataset on the first pass so the it will be much faster after that.
@adarob Hi Adam, Thank you for the update. I'm running it now (had to update python to 3.6 to be able to run the latest and it took a good part of the day). From what I see it did not improve, timing parameters are pretty much the same (global_step/sec: 0.401189, examples/sec: 821.635) , but strangely the TPU load decreased to 1%.
Can you maybe hardcode it to use ds = ds.cache()
just to make sure it is
getting enabled?
On Tue, Nov 19, 2019 at 12:34 PM Anatoly Khomenko notifications@github.com wrote:
@adarob https://github.com/adarob Hi Adam, Thank you for the update. I'm running it now (had to update python to 3.6 to be able to run the latest and it took a good part of the day). From what I see it did not improve, timing parameters are pretty much the same (global_step/sec: 0.401189, examples/sec: 821.635) , but strangely the TPU load decreased to 1%.
[image: image] https://user-images.githubusercontent.com/4006428/69184052-19cd6800-0ae2-11ea-95d8-c9b80e134ccd.png
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/15?email_source=notifications&email_token=ANQSF2GU5JJJNRAL53APFKTQUREWPA5CNFSM4JO3DXM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEPVD4I#issuecomment-555700721, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANQSF2BWK4JPFZA5UDYDGYTQUREWPANCNFSM4JO3DXMQ .
-- You received this message because you are subscribed to the Google Groups "t5-copybara" group. To unsubscribe from this group and stop receiving emails from it, send an email to t5-copybara+unsubscribe@google.com. To post to this group, send email to t5-copybara@google.com. To view this discussion on the web visit https://groups.google.com/a/google.com/d/msgid/t5-copybara/google-research/text-to-text-transfer-transformer/issues/15/555700721%40github.com https://groups.google.com/a/google.com/d/msgid/t5-copybara/google-research/text-to-text-transfer-transformer/issues/15/555700721%40github.com?utm_medium=email&utm_source=footer .
Hi @t5-copybara, As far as I understand it is here: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/utils.py#L654
I will comment out condition and keep ds = ds.cache()
Please, let me know if this is the correct approach.
Thank you!
@t5-copybara ,
I have found that I could specify use_cached in command line parameters like this:
--gin_param="mesh_train_dataset_fn.use_cached = True"
This is the complete line:
t5_mesh_transformer --tpu="${TPU_NAME}" --gcp_project="${PROJECT}" --tpu_zone="${ZONE}" --model_dir="${MODEL_DIR}" --t5_tfds_data_dir="${DATA_DIR}" --gin_file="dataset.gin" --gin_param="mesh_train_dataset_fn.use_cached = True" --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" --gin_param="utils.tpu_mesh_shape.tpu_topology = '2x2'" --gin_param="MIXTURE_NAME = 'super_glue_boolq_v102'" --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin"
When I run it though, I get the exception here:
Do you know where I can specify cache directories?
Ok, have found out that I can specify cache directories using:
--additional_task_cache_dirs="${CACHE_DIR}"
but in this case the cache does not get created as well.
This is the message I get:
22:18:44.970715 140004609750016 utils.py:584] 'super_glue_boolq_v102' does not exist in any task cache directories (searched ['gs://uniquebucketname/t5-boolq-data-dir-cache/super_glue_boolq_v102']).
Giving up for now.
The offline use_cached
stuff is only supported on our internal infrastructure for the time being. What I added for you is something that will do the caching on the fly. You should be able to explicitly enable it as you mentioned above ("I will comment out condition and keep ds = ds.cache()"). Are you sure you're actually using this new code when you run and not what's in the pip package?
@adarob Hi Adam, I'll give it a try on another run. But I see several places with ds = ds.cache(), do I have to make the change to all of them?
I'm using latest master by installing from source with command:
pip install --upgrade -e ./text-to-text-transfer-transformer
And I see all the recent fixes there, so I'm pretty sure to be using the most recent version.
Will let you know as soon as I run it again.
I am currently finetuning the large model on a 6GB TSV file and get a TPU usage of < 1% . Anything new here?
I'm not surprised that this would be I/O bound since your TSV is so large and would never be cached.
One thing to check is that your data and TPU are in the same region.
You can also shard your TSV into multiple files and pass in a path that can be globbed to find them all.
The ideal solution would be to pre-tokenize the TSV into sharded TFRecord files. It wouldn't be too hard to write a beam script to do this but it's not something I'll have time to do for a few weeks.
On Tue, Dec 17, 2019, 10:58 PM Fabian Langer notifications@github.com wrote:
I am currently finetuning the large model on a 6GB TSV file and get a TPU usage of < 1% . Anything new here?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/15?email_source=notifications&email_token=AAIJV2ENKDER4NOXOK3WHDDQZHCZHA5CNFSM4JO3DXM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHFCCVA#issuecomment-566894932, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJV2GI7PVVRRQ3O6VF6Y3QZHCZHANCNFSM4JO3DXMQ .
@f-lng , @adarob , I have ended up using the notebook provided here: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/notebooks/t5-trivia.ipynb
It seems to be using TPU more effectively. At least training that took more than 12 hours using the script from the issue, is accomplished in less than 4 hours using the notebook.
Size of the dataset is under 500Mb though.
@adarob , thank you for providing the notebook!
@adarob I was not aware that the size of the TSV file could be an issue, I assumed that the code would just read chunks of it. Thank you for clarifying, I will try to pre-shard it.
@anatoly-khomenko Thanks for letting me know, I will have a look at the notebook as well.
It does read chunks, but if it's sharded it can read multiple chunks in parallel.
On Thu, Dec 19, 2019, 1:01 AM Fabian Langer notifications@github.com wrote:
@adarob https://github.com/adarob I was not aware that the size of the TSV file could be an issue, I assumed that the code would just read chunks of it. Thank you for clarifying, I will try to pre-shard it.
@anatoly-khomenko https://github.com/anatoly-khomenko Thanks for letting me know, I will have a look at the notebook as well.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/15?email_source=notifications&email_token=AAIJV2CUDK4GSQHK6UP5XH3QZMZ5HA5CNFSM4JO3DXM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHI5HIA#issuecomment-567399328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJV2CGZMEXUGBTACLJUIDQZMZ5HANCNFSM4JO3DXMQ .
@adarob I did now take 7.000 examples from my dataset and precomputed 7 TFRecords files from it.
To create the TFRecords I directly used the T5 'Task' ( https://pastebin.com/36pG5ne4 )
I then adjusted your _get_cached_dataset function (and made sure its called) to load it ( https://pastebin.com/cvTzxEXh ) . The debug print is showing, so the function is called and the in-memory caching is also working.
I am using code that has been adapted from your Notebook to train the model ( https://pastebin.com/8am5S5H2 )
However, I am still getting a speed of ~50 examples/second and a TPU-CPU usage of <= 0.15% (sic) during training most of the time, with some spikes (1-2%)
( The naive approach of putting my huge TSV into the command line util gave me ~90 examples / sec. )
I do not have a lot of experience with the TF ecosystem except from some hacking around in Tensor2Tensor, and none with TPUs, so perhaps I am missing something important ?!
Btw I just checked, the buckets, the TPU and the VM are all in us-central1(-a) and it is a TPU v3-8.
@adarob I did another experiment and set tokens_per_batch to 1024^2 and trained on ~250k datapoints. The examples/second stayed at ~50. (Also note that I got OOM errors with such high batch sizes when training using the CLI but did not get one this time. )
@adarob @anatoly-khomenko I am having a similar issue and I am trying to finetune on the GPU. When training, GPU usage is less than 7% and there is a huge usage of CPU(It has to be I/O bound). I also ended up using the following notebook even with the example data provided:
https://github.com/google-research/text-to-text-transfer-transformer/blob/master/notebooks/t5-trivia.ipynb
Moreover, I even tried parameter mesh_train_dataset_fn.use_cached = True
.
Any suggestion or correction on what I might be doing wrong?
use_cached=True
won't work unless you have run the cache_tasks_main
preprocessing (https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/cache_tasks_main.py). You should also check that your data is being stored in the same region as your TPU/GPU. I'm not sure what else could be causing this issue.
Just to add here, (it is a hypothesis but) TPU has both a normal (ARM based or x86 based) CPU along with the accelerator (matrix multiplication units) and on cloud console it does not provide details about how much the accelerator is being used but it shows how much the CPU is being used.
In case of T5, TF-DS graph is uploaded to TPU, it uses TPU's CPU to execute non-deep learning ops like loading from GCS and preprocessing/tokenizing and then uses accelerator to do the deep learning training etc.
If you want to see how much TPU's accelerator is being used, you can use TPU profiling. In my case CPU was being used <0.1% but accelerator was being used at ~45%.
The problem
Seems like TPU utilization is not effective. The CPU load in Google Cloud console is under 7% when fine tuning:
While performance of the fine-tuning seems to be pretty low (global_step/sec: 0.402755, examples/sec: 824.843):
How to reproduce
I use the following configuration, provided as an example of fine tuning: