[General Questions] Rough estimates for training time for pre-training CoAtNet?

neel04 commented 2 years ago

Hi, 👋 Thanks for such an amazing library and taking out the time to implement so many parts of the CoatNet paper!

In your CoAtNet README, you mentioned you use TPU accelerators. Could you provide a ballpark for the amount of time it took for you to train the biggest models and the corresponding accelerators? I have a task for which I wish to use scaled-up models, but I'd have to pre-train on Imagenet first because of low data amount (<5-10M) and squeeze out maximum accuracy from fine-tuning.

I assume there might've been a few bottlenecks also, perhaps data? 🤔 If you could describe your setup, it would be very helpful to my experiments!

Sorry for bothering you with minor questions, and again thank you for all your work!

leondgarse commented 2 years ago

Those TPU descriptions in readme are from the article, so I don't know those time details either. It's CoAtNet0 only I ever tried training, using a single GPU. For this time usage, CoAtNet0 on imagenet with one GPU, it's about 1 hour per epoch. For TPU, have no experience.
Ya, other bottleneck or strategy may largely depends on dataset amount and quality. You may start with CoAtNet0 or other larger model with pretrained weights. If CoAtNet0 yield a sound result, may consider training a larger one on imagenet then. Still haven't tried it myself.

neel04 commented 2 years ago

Thanks for quick response! 🤗 What was the GPU you were using?

For TPU, have no experience.

Cool, I'll be training the model in a few weeks' time. If I manage to get it working on TPUs, or have any hiccups I'll surely share it here on a separate thread.

Have a great day!

leondgarse commented 2 years ago

For input_shape 160, it's by 1 RTX 2080 Ti, for input_shape 224, it's RTX 8000. Looking forward for your results then. :)

neel04 commented 2 years ago

I just did a pretty simple trial run on Colab w/ TPUs to see if its working:-

!pip install keras_cv_attention_models
!git clone https://github.com/leondgarse/keras_cv_attention_models.git
!cd keras_cv_attention_models; TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 ./train_script.py -m coatnet.CoAtNet0 --seed 0 --batch_size 128 -s CoAtNet0_160 --TPU -d 'cifar10' --disable_float16

for reproduction, and error:-

>>>> ALl args: Namespace(TPU=True, additional_model_kwargs={}, basic_save_name='CoAtNet0_160', batch_size=128, bce_threshold=0.2, cutmix_alpha=1.0, data_name='cifar10', disable_antialias=False, disable_float16=False, disable_positional_related_ops=False, distill_loss_weight=1, distill_temperature=10, enable_float16=True, epochs=-1, eval_central_crop=0.95, freeze_backbone=False, freeze_norm_layers=False, initial_epoch=0, input_shape=160, label_smoothing=0, lr_base_512=0.008, lr_cooldown_steps=5, lr_decay_on_batch=False, lr_decay_steps=100, lr_m_mul=0.5, lr_min=1e-06, lr_t_mul=2, lr_warmup=0.0001, lr_warmup_steps=5, magnitude=6, mixup_alpha=0.1, model='coatnet.CoAtNet0', num_layers=2, optimizer='LAMB', pretrained=None, random_crop_min=0.08, random_erasing_prob=0, rescale_mode='torch', resize_method='bicubic', restore_path=None, seed=0, summary=False, teacher_model=None, teacher_model_input_shape=-1, teacher_model_pretrained='imagenet', tensorboard_logs='auto', token_label_file=None, token_label_loss_weight=0.5, weight_decay=0.02)
2022-05-21 23:25:23.558107: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[TPU] All devices:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU')]
>>>> Set random seed: 0
2022-05-21 23:25:40.850248: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "NOT_FOUND: Error executing an HTTP request: HTTP response code 404".
>>>> init_model kwargs: {'input_shape': (160, 160, 3)}
>>>> Built model name: coatnet0
>>>> RandAugment: magnitude = 6, translate_const = 0.450000, cutout_const = 28.800000
>>>> Both mixup_alpha and cutmix_alpha provided: mixup_alpha = 0.1, cutmix_alpha = 1.0
>>>> Loss: BinaryCrossEntropyTimm, Optimizer: LAMB
>>>> basic_save_name = CoAtNet0_160
>>>> TensorBoard log path: logs/CoAtNet0_160_20220521-232552
Traceback (most recent call last):
  File "./train_script.py", line 223, in <module>
    run_training_by_args(args)
  File "./train_script.py", line 214, in run_training_by_args
    model, epochs, train_dataset, test_dataset, args.initial_epoch, lr_scheduler, args.basic_save_name, logs=args.tensorboard_logs
  File "/content/keras_cv_attention_models/keras_cv_attention_models/imagenet/train_func.py", line 251, in train
    workers=8,
  File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1315, in graph
    "Tensor.graph is undefined when eager execution is enabled.")
AttributeError: Tensor.graph is undefined when eager execution is enabled.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 2685, in async_wait
    context().sync_executors()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 740, in sync_executors
    pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.NotFoundError: Resource tpu_worker/_AnonymousVar7044/N10tensorflow22SummaryWriterInterfaceE does not exist.
    Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
2022-05-21 23:25:53.361613: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: INVALID_ARGUMENT: Unable to find the relevant tensor remote_handle: Op ID: 6693, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1653175553.358294014","description":"Error received from peer ipv4:10.19.39.50:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 6693, Output num: 0","grpc_status":3}

I'll keep working to fix it in my spare time otherwise I'll have to fully commit to debugging this when my exams are over 😉

leondgarse commented 2 years ago

It's my fault added a keras.callbacks.TensorBoard callback recently, and set it enabled writing to a local path by default. Just changed it disabled by default, and updated basic training result on MNIST in kecam_test.ipynb, as MNIST is available in public gcs. Please also note for other dataset like cifar10, needs a custom data loading process from uploaded gcs path, which I'm not very familiar with...

neel04 commented 2 years ago

Thanks for the speedy reply! You're right, apparently, GCP requires one to use a bucket to train on TPUs, or use a dataset available at TFDS. swapping CIFAR-10 to the flowers dataset yields no problems. Amazing work! 🌟

leondgarse commented 2 years ago

You may open another issue or discussion if anything new. Will just close this ATM. :)

neel04 commented 2 years ago

works! I'm just fixing some stuff, and integrating WandB support. I must say, your codebase is remarkably clean and modular considering you're working with TF! pretty terrific, makes debugging very easy 👍

leondgarse commented 2 years ago

I'm rather glad heard this!

leondgarse / keras_cv_attention_models

[General Questions] Rough estimates for training time for pre-training CoAtNet? #61