Training logs? - Githubissues

983 commented 8 months ago

First, I want to thank you for also uploading the training code. Unfortunately, many people only upload inference code, so this is nice to see.

I am currently trying to reproduce your results and after a bit of fiddling, I could successfully start a training run (mostly NVIDIA-related issues). But before I spend several days of compute, it would be encouraging to know that I have set up things correctly and everything works as intended. To verify this, it would be very helpful if you could upload your training logs somewhere.

Currently, my losses for the Composition-1k dataset with ViTMatte-S using two V100 GPUs look like this:

[03/11 17:30:37 d2.utils.events]:  eta: 1 day, 5:32:27  iter: 19  total_loss: 1.596  unknown_l1_loss: 0.3833  known_l1_loss: 0.4219  loss_pha_laplacian: 0.2485  loss_gradient_penalty: 0.5701    time: 1.4593  last_time: 2.0376  data_time: 1.3490  last_data_time: 1.5394   lr: 3.8462e-05  max_mem: 13416M
[03/11 17:31:11 d2.utils.events]:  eta: 2 days, 4:15:32  iter: 39  total_loss: 1.271  unknown_l1_loss: 0.3471  known_l1_loss: 0.2841  loss_pha_laplacian: 0.2096  loss_gradient_penalty: 0.4413    time: 1.5887  last_time: 0.9922  data_time: 1.2107  last_data_time: 0.4938   lr: 7.8422e-05  max_mem: 13420M
[03/11 17:31:46 d2.utils.events]:  eta: 2 days, 12:11:12  iter: 59  total_loss: 1.209  unknown_l1_loss: 0.2812  known_l1_loss: 0.2462  loss_pha_laplacian: 0.1737  loss_gradient_penalty: 0.5086    time: 1.6374  last_time: 1.8238  data_time: 1.2260  last_data_time: 1.3077   lr: 0.00011838  max_mem: 13420M
[03/11 17:32:16 d2.utils.events]:  eta: 2 days, 10:54:49  iter: 79  total_loss: 1.134  unknown_l1_loss: 0.2588  known_l1_loss: 0.2344  loss_pha_laplacian: 0.155  loss_gradient_penalty: 0.4721    time: 1.6055  last_time: 2.5264  data_time: 1.0279  last_data_time: 2.0413   lr: 0.00015834  max_mem: 13420M
[03/11 17:32:46 d2.utils.events]:  eta: 2 days, 8:23:43  iter: 99  total_loss: 1.043  unknown_l1_loss: 0.238  known_l1_loss: 0.2295  loss_pha_laplacian: 0.1402  loss_gradient_penalty: 0.4422    time: 1.5824  last_time: 0.4908  data_time: 1.0057  last_data_time: 0.0002   lr: 0.0001983  max_mem: 13420M
[...]
[03/11 17:57:01 d2.utils.events]:  eta: 2 days, 8:17:43  iter: 999  total_loss: 0.4402  unknown_l1_loss: 0.08424  known_l1_loss: 0.01566  loss_pha_laplacian: 0.06101  loss_gradient_penalty: 0.2855    time: 1.6125  last_time: 2.7538  data_time: 1.3407  last_data_time: 2.2811   lr: 0.0005  max_mem: 13421M
[...]
[03/11 18:23:28 d2.utils.events]:  eta: 2 days, 9:07:33  iter: 1999  total_loss: 0.3541  unknown_l1_loss: 0.06284  known_l1_loss: 0.0044  loss_pha_laplacian: 0.04422  loss_gradient_penalty: 0.2473    time: 1.5995  last_time: 0.5045  data_time: 1.1437  last_data_time: 0.0001   lr: 0.0005  max_mem: 13421M
[...]
[03/11 18:50:22 d2.utils.events]:  eta: 2 days, 9:07:25  iter: 2999  total_loss: 0.3096  unknown_l1_loss: 0.05786  known_l1_loss: 0.001829  loss_pha_laplacian: 0.04414  loss_gradient_penalty: 0.2118    time: 1.6039  last_time: 0.4861  data_time: 1.2241  last_data_time: 0.0004   lr: 0.0005  max_mem: 13423M
[...]
[03/11 19:16:27 d2.utils.events]:  eta: 2 days, 6:47:49  iter: 3999  total_loss: 0.32  unknown_l1_loss: 0.05314  known_l1_loss: 0.001046  loss_pha_laplacian: 0.03863  loss_gradient_penalty: 0.2317    time: 1.5940  last_time: 1.9088  data_time: 1.0848  last_data_time: 1.4184   lr: 0.0005  max_mem: 13423M
[...]
[03/11 19:42:25 d2.utils.events]:  eta: 2 days, 7:34:43  iter: 4999  total_loss: 0.2972  unknown_l1_loss: 0.05066  known_l1_loss: 0.0007398  loss_pha_laplacian: 0.03699  loss_gradient_penalty: 0.204    time: 1.5868  last_time: 2.4723  data_time: 1.1195  last_data_time: 2.0001   lr: 0.0005  max_mem: 13423M
[...]
[03/11 20:09:07 d2.utils.events]:  eta: 2 days, 6:12:37  iter: 5999  total_loss: 0.2781  unknown_l1_loss: 0.04635  known_l1_loss: 0.0004997  loss_pha_laplacian: 0.0348  loss_gradient_penalty: 0.1972    time: 1.5891  last_time: 1.9816  data_time: 1.1178  last_data_time: 1.4868   lr: 0.0005  max_mem: 13423M
[...]
[03/11 20:35:20 d2.utils.events]:  eta: 2 days, 6:06:32  iter: 6999  total_loss: 0.2854  unknown_l1_loss: 0.0475  known_l1_loss: 0.0003681  loss_pha_laplacian: 0.03289  loss_gradient_penalty: 0.202    time: 1.5867  last_time: 0.5729  data_time: 0.9335  last_data_time: 0.1039   lr: 0.0005  max_mem: 13423M
[...]
[03/11 21:01:54 d2.utils.events]:  eta: 2 days, 5:54:35  iter: 7999  total_loss: 0.2869  unknown_l1_loss: 0.04732  known_l1_loss: 0.0002785  loss_pha_laplacian: 0.03428  loss_gradient_penalty: 0.2033    time: 1.5875  last_time: 2.7392  data_time: 1.1434  last_data_time: 2.2507   lr: 0.0005  max_mem: 13423M
[...]
[03/11 21:28:22 d2.utils.events]:  eta: 2 days, 3:45:42  iter: 8999  total_loss: 0.2688  unknown_l1_loss: 0.04204  known_l1_loss: 0.0001971  loss_pha_laplacian: 0.03091  loss_gradient_penalty: 0.1913    time: 1.5875  last_time: 1.4255  data_time: 1.0987  last_data_time: 0.9286   lr: 0.0005  max_mem: 13423M

losses

983 commented 8 months ago

Training for ViTMatte-S is done. Here is the training log: https://gist.github.com/983/472b8e7a693be10f877ae85cd37325de

Results from evaluation.py:

Unknown Region: MSE: 0.003220435135814517 SAD: 21.264837254516603

The results are very good, so I guess things were configured correctly.

losses

davislee546 commented 7 months ago

How do you display the loss change curve during the training process?

983 commented 7 months ago

I ran the training script while simultaneously piping the output to a file named log.txt using tee.

CUDA_VISIBLE_DEVICES=0,1 python3 -u main.py --config-file configs/ViTMatte_S_100ep.py --num-gpus 2 2>&1 | tee log.txt

The -u is to get unbuffered output (otherwise, output only appears sporadically) and the 2>&1 is to also capture potential error messages (if there were any).

I then extracted the numbers from log.txt using regular expressions and plotted them using matplotlib. I do not have the original plotting code anymore, but I think it was something like this:

import re
import matplotlib.pyplot as plt

with open("log.txt") as f:
    lines = f.read().strip().split("\n")

# Skip the first part of the log
index = 0
while "Starting training from iteration 0" not in lines[index]:
    index += 1
lines = lines[index + 1:]

# Skip lines we are not interested in
lines = [line for line in lines if "Saving checkpoint to" not in line]
lines = [line for line in lines if "does not have enough unknown area for crop" not in line]
lines = [line for line in lines if "Overall training speed" not in line]
lines = [line for line in lines if "Total training time" not in line]

# Parse lines into dict
pattern = re.compile(r"^\[(\d+/\d+) (\d+:\d+:\d+) d2.utils.events\]: +eta: (.*?) iter: (\d+) +total_loss: (.*?) unknown_l1_loss: (.*?) known_l1_loss: (.*?) loss_pha_laplacian: (.*?) loss_gradient_penalty: (.*?) +time: (.*?) +last_time: (.*?) +data_time: (.*?) +last_data_time: (.*?) +lr: (.*?) +max_mem: (.*?)M$")
names = ["day", "time", "eta", "iter", "total_loss", "unknown_l1_loss", "known_l1_loss", "loss_pha_laplacian", "loss_gradient_penalty", "time", "last_time", "data_time", "last_data_time", "lr", "max_mem"]
results = {name: [] for name in names}
for line in lines:
    groups = pattern.match(line).groups()
    for name, group in zip(names, groups):
        results[name].append(group)

# Plot the results
plt.figure(figsize=(10, 20))
results["iter"] = [int(x) for x in results["iter"]]
for name in ["total_loss", "unknown_l1_loss", "known_l1_loss", "loss_pha_laplacian", "loss_gradient_penalty"]:
    results[name] = [float(x) for x in results[name]]
    plt.semilogy(results["iter"], results[name], label=name)
plt.legend()
plt.savefig("loss.png", dpi=300)

davislee546 commented 7 months ago

log.txt我运行训练脚本，同时将输出传输到名为using 的文件tee。

CUDA_VISIBLE_DEVICES=0,1 python3 -u main.py --config-file configs/ViTMatte_S_100ep.py --num-gpus 2 2>&1 | tee log.txt

目的-u是获得无缓冲的输出（否则，输出仅偶尔出现），并且2>&1还捕获潜在的错误消息（如果有的话）。

然后，我使用正则表达式提取数字log.txt 并使用 matplotlib 绘制它们。我不再有原始的绘图代码，但我认为它是这样的：

import re
import matplotlib.pyplot as plt

with open("log.txt") as f:
    lines = f.read().strip().split("\n")

# Skip the first part of the log
index = 0
while "Starting training from iteration 0" not in lines[index]:
    index += 1
lines = lines[index + 1:]

# Skip lines we are not interested in
lines = [line for line in lines if "Saving checkpoint to" not in line]
lines = [line for line in lines if "does not have enough unknown area for crop" not in line]
lines = [line for line in lines if "Overall training speed" not in line]
lines = [line for line in lines if "Total training time" not in line]

# Parse lines into dict
pattern = re.compile(r"^\[(\d+/\d+) (\d+:\d+:\d+) d2.utils.events\]: +eta: (.*?) iter: (\d+) +total_loss: (.*?) unknown_l1_loss: (.*?) known_l1_loss: (.*?) loss_pha_laplacian: (.*?) loss_gradient_penalty: (.*?) +time: (.*?) +last_time: (.*?) +data_time: (.*?) +last_data_time: (.*?) +lr: (.*?) +max_mem: (.*?)M$")
names = ["day", "time", "eta", "iter", "total_loss", "unknown_l1_loss", "known_l1_loss", "loss_pha_laplacian", "loss_gradient_penalty", "time", "last_time", "data_time", "last_data_time", "lr", "max_mem"]
results = {name: [] for name in names}
for line in lines:
    groups = pattern.match(line).groups()
    for name, group in zip(names, groups):
        results[name].append(group)

# Plot the results
plt.figure(figsize=(10, 20))
results["iter"] = [int(x) for x in results["iter"]]
for name in ["total_loss", "unknown_l1_loss", "known_l1_loss", "loss_pha_laplacian", "loss_gradient_penalty"]:
    results[name] = [float(x) for x in results[name]]
    plt.semilogy(results["iter"], results[name], label=name)
plt.legend()
plt.savefig("loss.png", dpi=300)

Is it possible to read the loss data from a log.txt file generated after training completion and plot a loss graph?

hustvl / ViTMatte

Training logs? #32