Closed neel04 closed 2 years ago
CoAtNet0
only I ever tried training, using a single GPU. For this time usage, CoAtNet0
on imagenet with one GPU, it's about 1 hour per epoch. For TPU, have no experience.CoAtNet0
or other larger model with pretrained weights. If CoAtNet0
yield a sound result, may consider training a larger one on imagenet then. Still haven't tried it myself.Thanks for quick response! 🤗 What was the GPU you were using?
For TPU, have no experience.
Cool, I'll be training the model in a few weeks' time. If I manage to get it working on TPUs, or have any hiccups I'll surely share it here on a separate thread.
Have a great day!
For input_shape 160, it's by 1 RTX 2080 Ti
, for input_shape 224, it's RTX 8000
. Looking forward for your results then. :)
I just did a pretty simple trial run on Colab w/ TPUs to see if its working:-
!pip install keras_cv_attention_models
!git clone https://github.com/leondgarse/keras_cv_attention_models.git
!cd keras_cv_attention_models; TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 ./train_script.py -m coatnet.CoAtNet0 --seed 0 --batch_size 128 -s CoAtNet0_160 --TPU -d 'cifar10' --disable_float16
for reproduction, and error:-
>>>> ALl args: Namespace(TPU=True, additional_model_kwargs={}, basic_save_name='CoAtNet0_160', batch_size=128, bce_threshold=0.2, cutmix_alpha=1.0, data_name='cifar10', disable_antialias=False, disable_float16=False, disable_positional_related_ops=False, distill_loss_weight=1, distill_temperature=10, enable_float16=True, epochs=-1, eval_central_crop=0.95, freeze_backbone=False, freeze_norm_layers=False, initial_epoch=0, input_shape=160, label_smoothing=0, lr_base_512=0.008, lr_cooldown_steps=5, lr_decay_on_batch=False, lr_decay_steps=100, lr_m_mul=0.5, lr_min=1e-06, lr_t_mul=2, lr_warmup=0.0001, lr_warmup_steps=5, magnitude=6, mixup_alpha=0.1, model='coatnet.CoAtNet0', num_layers=2, optimizer='LAMB', pretrained=None, random_crop_min=0.08, random_erasing_prob=0, rescale_mode='torch', resize_method='bicubic', restore_path=None, seed=0, summary=False, teacher_model=None, teacher_model_input_shape=-1, teacher_model_pretrained='imagenet', tensorboard_logs='auto', token_label_file=None, token_label_loss_weight=0.5, weight_decay=0.02)
2022-05-21 23:25:23.558107: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[TPU] All devices: [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU')]
>>>> Set random seed: 0
2022-05-21 23:25:40.850248: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "NOT_FOUND: Error executing an HTTP request: HTTP response code 404".
>>>> init_model kwargs: {'input_shape': (160, 160, 3)}
>>>> Built model name: coatnet0
>>>> RandAugment: magnitude = 6, translate_const = 0.450000, cutout_const = 28.800000
>>>> Both mixup_alpha and cutmix_alpha provided: mixup_alpha = 0.1, cutmix_alpha = 1.0
>>>> Loss: BinaryCrossEntropyTimm, Optimizer: LAMB
>>>> basic_save_name = CoAtNet0_160
>>>> TensorBoard log path: logs/CoAtNet0_160_20220521-232552
Traceback (most recent call last):
File "./train_script.py", line 223, in <module>
run_training_by_args(args)
File "./train_script.py", line 214, in run_training_by_args
model, epochs, train_dataset, test_dataset, args.initial_epoch, lr_scheduler, args.basic_save_name, logs=args.tensorboard_logs
File "/content/keras_cv_attention_models/keras_cv_attention_models/imagenet/train_func.py", line 251, in train
workers=8,
File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1315, in graph
"Tensor.graph is undefined when eager execution is enabled.")
AttributeError: Tensor.graph is undefined when eager execution is enabled.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 2685, in async_wait
context().sync_executors()
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 740, in sync_executors
pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.NotFoundError: Resource tpu_worker/_AnonymousVar7044/N10tensorflow22SummaryWriterInterfaceE does not exist.
Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
2022-05-21 23:25:53.361613: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: INVALID_ARGUMENT: Unable to find the relevant tensor remote_handle: Op ID: 6693, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1653175553.358294014","description":"Error received from peer ipv4:10.19.39.50:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 6693, Output num: 0","grpc_status":3}
I'll keep working to fix it in my spare time otherwise I'll have to fully commit to debugging this when my exams are over 😉
It's my fault added a keras.callbacks.TensorBoard
callback recently, and set it enabled writing to a local path by default. Just changed it disabled by default, and updated basic training result on MNIST in kecam_test.ipynb, as MNIST is available in public gcs. Please also note for other dataset like cifar10, needs a custom data loading process from uploaded gcs
path, which I'm not very familiar with...
Thanks for the speedy reply! You're right, apparently, GCP requires one to use a bucket to train on TPUs, or use a dataset available at TFDS. swapping CIFAR-10 to the flowers dataset yields no problems. Amazing work! 🌟
You may open another issue or discussion if anything new. Will just close this ATM. :)
works! I'm just fixing some stuff, and integrating WandB
support. I must say, your codebase is remarkably clean and modular considering you're working with TF! pretty terrific, makes debugging very easy 👍
I'm rather glad heard this!
Hi, 👋 Thanks for such an amazing library and taking out the time to implement so many parts of the CoatNet paper!
In your CoAtNet README, you mentioned you use TPU accelerators. Could you provide a ballpark for the amount of time it took for you to train the biggest models and the corresponding accelerators? I have a task for which I wish to use scaled-up models, but I'd have to pre-train on Imagenet first because of low data amount (<5-10M) and squeeze out maximum accuracy from fine-tuning.
I assume there might've been a few bottlenecks also, perhaps data? 🤔 If you could describe your setup, it would be very helpful to my experiments!
Sorry for bothering you with minor questions, and again thank you for all your work!