How long is inference on A100-40GB expected to take?

I assume launch_inference.sh is meant to run inference on the motorcycle image. But so far it's been going for over 30 mins with no end in sight. I also noticed it calls launch.py in --train mode. Is this intended?
Here's the log:
(base) ...:~/zeronvs$ sh launch_inference.sh 
/opt/conda/lib/python3.10/site-packages/controlnet_aux/mediapipe_face/mediapipe_face_common.py:7: UserWarning: The module 'mediapipe' is not installed. The package will have limited functionality. Please install it using the command: pip install 'mediapipe'
  warnings.warn(
Seed set to 0
Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off]
/opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=VGG16_Weights.IMAGENET1K_V1`. You can also use `weights=VGG16_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/vgg16-397923af.pth" to /home/rkabra_google_com/.cache/torch/hub/checkpoints/vgg16-397923af.pth
100%|███████████████████████████████████████████████████████████████████████████████████| 528M/528M [00:01<00:00, 291MB/s]
Loading model from: /opt/conda/lib/python3.10/site-packages/lpips/weights/v0.1/vgg.pth
[INFO] Using 16bit Automatic Mixed Precision (AMP)
[INFO] GPU available: True (cuda), used: True
[INFO] TPU available: False, using: 0 TPU cores
[INFO] HPU available: False, using: 0 HPUs
[INFO] You are using a CUDA device ('NVIDIA A100-SXM4-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
[INFO] single image dataset: load image motorcycle.png torch.Size([1, 128, 128, 3])
[INFO] single image dataset: load image motorcycle.png torch.Size([1, 128, 128, 3])
[INFO] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[INFO] 
  | Name       | Type                          | Params | Mode 
---------------------------------------------------------------------
0 | geometry   | ImplicitVolume                | 12.6 M | train
1 | material   | DiffuseWithPointLightMaterial | 0      | train
2 | background | SolidColorBackground          | 0      | train
3 | renderer   | NeRFVolumeRenderer            | 767 K  | train
4 | lpips_fn   | LPIPS                         | 14.7 M | eval 
---------------------------------------------------------------------
13.4 M    Trainable params
14.7 M    Non-trainable params
28.1 M    Total params
112.350   Total estimated model params size (MB)
[INFO] Validation results will be saved to outputs/zero123/[128, 256]_motorcycle.png_prog1000@20241019-124707/save
[INFO] Loading Zero123 ...
SDS distillation only, disabling some functionality...
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.53 M params.
Keeping EMAs of 688.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
100%|███████████████████████████████████████| 890M/890M [00:18<00:00, 51.5MiB/s]
[INFO] Loaded Zero123!
/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
Epoch 0: |                                                             | 1356/? [31:53<00:00,  0.71it/s, train/loss=77.20]
kylesargent / ZeroNVS

How long is inference on A100-40GB expected to take? #27