kimiyoung / transformer-xl

Apache License 2.0
3.6k stars 762 forks source link

Issues with wt103_large_tpu.sh #57

Open bmccann opened 5 years ago

bmccann commented 5 years ago

Seems VLIDA_BSZ should be replaced with VALID_BSZ in a couple places. For example:

https://github.com/kimiyoung/transformer-xl/blob/e619492d7168d55ed14e443af5e56b9599ee469d/tf/scripts/wt103_large_tpu.sh#L42

bmccann commented 5 years ago

BSZ also appears to be undefined:

https://github.com/kimiyoung/transformer-xl/blob/e619492d7168d55ed14e443af5e56b9599ee469d/tf/scripts/wt103_large_tpu.sh#L93

Perhaps something like this is intended:

BSZ=$(($TRAIN_BSZ * $NUM_HOST))
bmccann commented 5 years ago

I'm using a TPUv3, but leaving the NUM_CORES=16

https://github.com/kimiyoung/transformer-xl/blob/e619492d7168d55ed14e443af5e56b9599ee469d/tf/scripts/wt103_large_tpu.sh#L10

results in the following error:

ValueError: TPUConfig.num_shards is not set correctly. According to TPU system metadata for Tensorflow master (grpc://...0): num_replicas should be (8), got (64). For non-model-parallelism, num_replicas should be the total num of TPU cores in the system. For model-parallelism, the total number of TPU cores should be num_cores_per_replica * num_replicas. Please set it accordingly or leave it as `None`

Setting NUM_CORES=8 appears to fix this one.

kimiyoung commented 5 years ago

I'm not super familiar with Cloud TPUs, because TPU v3 in general has two configurations. Small slices (like the one you use) have 8 cores per host, and large pods have 16 cores per host. Sorry for the confusion.

bmccann commented 5 years ago

How do you configure multiple hosts? It seems I also need to set NUM_HOSTS=1 or I get similar errors about the num_shards

kimiyoung commented 5 years ago

And both NUM_CORE=8 and NUM_CORE=16 result in the same error when you use more than one host?

bmccann commented 5 years ago

They are very similar. With 16 cores and 4 hosts:

ValueError: TPUConfig.num_shards is not set correctly. According to TPU system metadata for Tensorflow master (grpc://...): **num_replicas should be (8), got (64)**. For non-model-parallelism, num_replicas should be the total num of TPU cores in the system. For model-parallelism, the total number of TPU cores should be num_cores_per_replica * num_replicas. Please set it accordingly or leave it as `None`

with 8 cores and 4 hosts:

ValueError: TPUConfig.num_shards is not set correctly. According to TPU system metadata for Tensorflow master (grpc://...): **num_replicas should be (8), got (32)**. For non-model-parallelism, num_replicas should be the total num of TPU cores in the system. For model-parallelism, the total number of TPU cores should be num_cores_per_replica * num_replicas. Please set it accordingly or leave it as `None`

setting the NUM_HOST to 1 and NUM_CORE to 8 gives 8*1 num_replicas and it works out. I'm not sure whether I need to be doing something before running to start up multiple hosts or whether the TPU configuration code should be handling that (fairly new to TPUs as you can probably tell)

kimiyoung commented 5 years ago

It looks like you have fired only one host instead of four. You would probably need to refer to Cloud TPU docs for how to fire multiple hosts. I have not used Cloud TPUs before.

ijkilchenko commented 5 years ago

Both v2 and v3 have 1 host if these are donuts (2x2 configuration). Both v2 and v3 have 4 chips and 8 cores (that's 2 cores per chip). Yes you can get bigger configurations/slices (such as 4x4, 8x8, or a full pod) and those will have more than 1 host.

bmccann commented 5 years ago

Got it. Thanks to both of you!

bmccann commented 5 years ago

TRAIN_BSZ now seems to be used in an inconsistent way.

In https://github.com/kimiyoung/transformer-xl/blob/44781ed21dbaec88b280f74d9ae2877f52b492a5/tf/scripts/wt103_large_tpu.sh#L41 it is the per_host_train_batch_size

and in https://github.com/kimiyoung/transformer-xl/blob/44781ed21dbaec88b280f74d9ae2877f52b492a5/tf/scripts/wt103_large_tpu.sh#L93

it is used as the aggregate batch size, which is then divided by the number of hosts at

https://github.com/kimiyoung/transformer-xl/blob/44781ed21dbaec88b280f74d9ae2877f52b492a5/tf/train.py#L344

ijkilchenko commented 5 years ago

I think that the way I've fixed it is I've set NUM_HOST=1 (https://github.com/kimiyoung/transformer-xl/blob/master/tf/scripts/wt103_large_tpu.sh#L9), but I agree, there are inconsistencies that make it hard to reproduce the work. It seems like the intention changes between using a per host batch size and a global batch size depending on what file you use.

bmccann commented 5 years ago

Right, this is only a problem now that I'm experimenting with NUM_HOST > 1.