Closed 983 closed 8 months ago
Training for ViTMatte-S is done. Here is the training log: https://gist.github.com/983/472b8e7a693be10f877ae85cd37325de
Results from evaluation.py
:
The results are very good, so I guess things were configured correctly.
How do you display the loss change curve during the training process?
I ran the training script while simultaneously piping the output to a file named log.txt
using tee
.
CUDA_VISIBLE_DEVICES=0,1 python3 -u main.py --config-file configs/ViTMatte_S_100ep.py --num-gpus 2 2>&1 | tee log.txt
The -u
is to get unbuffered output (otherwise, output only appears sporadically) and the 2>&1
is to also capture potential error messages (if there were any).
I then extracted the numbers from log.txt
using regular expressions and plotted them using matplotlib. I do not have the original plotting code anymore, but I think it was something like this:
import re
import matplotlib.pyplot as plt
with open("log.txt") as f:
lines = f.read().strip().split("\n")
# Skip the first part of the log
index = 0
while "Starting training from iteration 0" not in lines[index]:
index += 1
lines = lines[index + 1:]
# Skip lines we are not interested in
lines = [line for line in lines if "Saving checkpoint to" not in line]
lines = [line for line in lines if "does not have enough unknown area for crop" not in line]
lines = [line for line in lines if "Overall training speed" not in line]
lines = [line for line in lines if "Total training time" not in line]
# Parse lines into dict
pattern = re.compile(r"^\[(\d+/\d+) (\d+:\d+:\d+) d2.utils.events\]: +eta: (.*?) iter: (\d+) +total_loss: (.*?) unknown_l1_loss: (.*?) known_l1_loss: (.*?) loss_pha_laplacian: (.*?) loss_gradient_penalty: (.*?) +time: (.*?) +last_time: (.*?) +data_time: (.*?) +last_data_time: (.*?) +lr: (.*?) +max_mem: (.*?)M$")
names = ["day", "time", "eta", "iter", "total_loss", "unknown_l1_loss", "known_l1_loss", "loss_pha_laplacian", "loss_gradient_penalty", "time", "last_time", "data_time", "last_data_time", "lr", "max_mem"]
results = {name: [] for name in names}
for line in lines:
groups = pattern.match(line).groups()
for name, group in zip(names, groups):
results[name].append(group)
# Plot the results
plt.figure(figsize=(10, 20))
results["iter"] = [int(x) for x in results["iter"]]
for name in ["total_loss", "unknown_l1_loss", "known_l1_loss", "loss_pha_laplacian", "loss_gradient_penalty"]:
results[name] = [float(x) for x in results[name]]
plt.semilogy(results["iter"], results[name], label=name)
plt.legend()
plt.savefig("loss.png", dpi=300)
log.txt
我运行训练脚本,同时将输出传输到名为using 的文件tee
。CUDA_VISIBLE_DEVICES=0,1 python3 -u main.py --config-file configs/ViTMatte_S_100ep.py --num-gpus 2 2>&1 | tee log.txt
目的
-u
是获得无缓冲的输出(否则,输出仅偶尔出现),并且2>&1
还捕获潜在的错误消息(如果有的话)。然后,我使用正则表达式提取数字
log.txt
并使用 matplotlib 绘制它们。我不再有原始的绘图代码,但我认为它是这样的:import re import matplotlib.pyplot as plt with open("log.txt") as f: lines = f.read().strip().split("\n") # Skip the first part of the log index = 0 while "Starting training from iteration 0" not in lines[index]: index += 1 lines = lines[index + 1:] # Skip lines we are not interested in lines = [line for line in lines if "Saving checkpoint to" not in line] lines = [line for line in lines if "does not have enough unknown area for crop" not in line] lines = [line for line in lines if "Overall training speed" not in line] lines = [line for line in lines if "Total training time" not in line] # Parse lines into dict pattern = re.compile(r"^\[(\d+/\d+) (\d+:\d+:\d+) d2.utils.events\]: +eta: (.*?) iter: (\d+) +total_loss: (.*?) unknown_l1_loss: (.*?) known_l1_loss: (.*?) loss_pha_laplacian: (.*?) loss_gradient_penalty: (.*?) +time: (.*?) +last_time: (.*?) +data_time: (.*?) +last_data_time: (.*?) +lr: (.*?) +max_mem: (.*?)M$") names = ["day", "time", "eta", "iter", "total_loss", "unknown_l1_loss", "known_l1_loss", "loss_pha_laplacian", "loss_gradient_penalty", "time", "last_time", "data_time", "last_data_time", "lr", "max_mem"] results = {name: [] for name in names} for line in lines: groups = pattern.match(line).groups() for name, group in zip(names, groups): results[name].append(group) # Plot the results plt.figure(figsize=(10, 20)) results["iter"] = [int(x) for x in results["iter"]] for name in ["total_loss", "unknown_l1_loss", "known_l1_loss", "loss_pha_laplacian", "loss_gradient_penalty"]: results[name] = [float(x) for x in results[name]] plt.semilogy(results["iter"], results[name], label=name) plt.legend() plt.savefig("loss.png", dpi=300)
Is it possible to read the loss data from a log.txt file generated after training completion and plot a loss graph?
First, I want to thank you for also uploading the training code. Unfortunately, many people only upload inference code, so this is nice to see.
I am currently trying to reproduce your results and after a bit of fiddling, I could successfully start a training run (mostly NVIDIA-related issues). But before I spend several days of compute, it would be encouraging to know that I have set up things correctly and everything works as intended. To verify this, it would be very helpful if you could upload your training logs somewhere.
Currently, my losses for the Composition-1k dataset with ViTMatte-S using two V100 GPUs look like this: