Open ConsciousML opened 3 years ago
If you were able to reproduce our results by running the provided model file this issue is something you have to debug on your side. I would start with the data loading code and see if your sample is actually getting loaded and then work myself up the pipeline. If you run the Visdom server you should see your sample being visualized.
Ok thanks for the response.
I see a command line example to train on custom temporal data:
python src/train.py with \
deformable \
tracking \
mot17 \
full_res \
resume=models/mot17_train_deformable_private/checkpoint.pth \
output_dir=models/custom_dataset_train_deformable \
mot_path=data/custom_dataset \
train_split=train \
val_split=val \
epochs=20 \
Quoting your paper:
Simulate MOT from single images. The encoderdecoder multi-level attention mechanism requires substantial amounts of training data. Hence, we follow a similar approach as in [56] and simulate MOT data from the CrowdHuman [43] person detection dataset. The adjacent training
frames t−1 and t are generated by applying random spatial
augmentations to a single image. To simulate high frame
rates as in MOT17 [28], we only randomly resize and crop
of up to 5% with respect to the original image size.
What I want to do is the exact same thing as training on CrowdHuman but with my own data. Can you confirm that I'm using the right command line options please ?
Also is [xtl, ytl, w, h]
the right format as I described ealier ?
I've been debugging since two days and I would greatly appreciate if you could help ;)
Thanks !
Edit: I made sure that the coco annotated format was well formed, I checked that every box lie inside of the image. Unfortunately I can't see anything using Visdom.
Since you are loading the mot17
config your command is correct if you want to train on a multi-object tracking dataset. But since this is not the case for you, you need to change a few things. For example:
python src/train.py with \
deformable \
tracking \
crowdhuman \
full_res \
resume=models/mot17_train_deformable_private/checkpoint.pth \
output_dir=models/custom_dataset_train_deformable \
crowdhuman_path=data/custom_dataset \
train_split=train \
val_split=val \
epochs=20 \
Your removal of the tracking
option was wrong since you want to do tracking. But you dont want to do it on a tracking (mot
) dataset but a static (like crowdhuman
) dataset which then simulates tracking frame-pairs.
What do you mean by "can't see anything" in Visdom?
After trying this, I faced an assert here: https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/datasets/crowdhuman.py#L12
I fixed it by changing train_split=train
to crowdhuman_train_split=train
.
Then I noticed that by leaving mot_path
to the default data/MOT17
it was searching for the mot17
data here:
https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/datasets/mot.py#L161
I guess I had to set also mot_path=data/custom_dataset
as I don't want to train on mot17
or train_split
to None
?
https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/datasets/mot.py#L190
It seems I now have a CUDA out of memory error. I will try on a bigger GPU.
Thanks again ! You were very helpful.
I got rid of the full_res
option and I was able to perform 50 iterations.
python src/train.py with \
deformable \
tracking \
crowdhuman \
resume=models/mot17_train_deformable_private/checkpoint.pth \
output_dir=models/boxy_train_deformable \
crowdhuman_path=/media/bigbro/Data/Dataset/Test/SmallCOCO \
crowdhuman_train_split=train \
mot_path=/media/bigbro/Data/Dataset/Test/SmallCOCO \
train_split=None \
val_split=val \
epochs=50 \
But than I faced the same error here: https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/util/box_ops.py#L51
I guess it is my input data. After conversion in the COCO format, I cropped every bounding box from the image without error. I don't know what else to try. Do you have any clue about how this can happen and how can I debug it ?
By the way, is it possible to perform the validation on static data ? (on CrowdHuman like dataset)
Yes it is possible to perform validation on static data but not for tracking metrics.
Try to use the Visdom visualization to see if your data is formatted correctly and have an overview on your losses etc.. Besides that it is hard to say where and why the error occurs if you dont provide the full stackstrace. Does the assert error happen during validation, tracking evaluation or somehwhere else?
This is the stack trace:
Traceback (most recent call last):
File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/bigbro/.vscode/extensions/ms-python.python-2021.5.842923320/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
cli.main()
File "/home/bigbro/.vscode/extensions/ms-python.python-2021.5.842923320/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
run()
File "/home/bigbro/.vscode/extensions/ms-python.python-2021.5.842923320/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "src/train.py", line 336, in <module>
train(args)
File "src/train.py", line 263, in train
visualizers['train'], args)
File "src/trackformer/engine.py", line 106, in train_one_epoch
outputs, targets, *_ = model(samples, targets)
File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "src/trackformer/models/detr_tracking.py", line 45, in forward
prev_indices = self._matcher(prev_outputs_without_aux, prev_targets)
File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "src/trackformer/models/matcher.py", line 94, in forward
box_cxcywh_to_xyxy(tgt_bbox))
File "src/trackformer/util/box_ops.py", line 51, in generalized_box_iou
assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError
I will try to replicate the training with Crowdhuman and mot17 to see if I notice a difference with my grountruth.
Does this error occur immediately or after a few iterations? The former would be very strange indeed. I think this is related to your input data. You should debug the data flow and check if everything works as when training on CrowdHuman. For example is this code after the if statement executed properly:
Unfortunately, it occurs after a few iterations. Ok I will try to debug the code in this file.
I discovered that in Visdom, my training images seemed cut. Apparently is it not the case when I train with CrowdHuman: visdom.pdf
I try to train from a top view perspective and it seems that the image augmentation produces a weird result.
Edit: It is related to my input data cause it worked out of the box with CrowdHuman and MOT17. But I can't see any difference in the boxes, I made a script to visualize them and it crops CrowdHuman and my own data in the same fashion.
The only difference is that I didn't restrict the data to at least 2 annotations per frame. Do you think this can have an impact ?
You should debug the data flow and check if everything works as when training on CrowdHuman. For example is this code after the if statement executed properly:
I debugged and didn't notice anything weird. My guess is that maybe because the RandomSizeCrop is more aggressive on my data, maybe at some point it crops out the detection ?
I will check without the RandomSizeCrop.
Edit: Same error without the crop.
If it works for our CrowdHuman/MOT you have to figure out what the difference to your custom dataset is. You said the boxes
in assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
contrains NaN values. This sounds like a problem that comes from training on your data. Not directly the data itself. Does the loss for a particular iteration go up a lot? What happens if you run a training and set the learning rates to zero. Does the error still occur?
I noticed that when I face the assert, the out_bbox
as well as out_probs
contains Nans. I suppose it is the prediction of the model.
https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/models/matcher.py#L71
Same thing when I set the lr to 0.
I will go up to see where the Nans first appear.
I noticed that the Nans first appears in the query_embed
variable:
https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/models/deformable_detr.py#L167
Do you have any clue where this is updated ?
But you said NaNs appear even if you set all the learning rates to zero? This would be contradictive to the query embeddings becoming zero. Double check if this still happens if you set all rates to zero or comment out the optimizer step.
You were right ! I might forgot to set a lr to 0. When I comment the optimizer step, it successfully runs a whole epoch.
Hi @timmeinhardt,
I'm still stuck and can't fix the problem. Any idea about xhat to do ?
How do the loss values develop until the network produces NaN output values? If for example the bounding box loss goes up there might be something wrong with that.
The overall loss before crash is:
7.164515495300293
4.680170059204102
4.274537086486816
5.886708736419678
4.5304131507873535
3.8330774307250977
7.964834690093994
3.0583243370056152
4.674091339111328
3.301724672317505
3.608858346939087
2.8133819103240967
2.8961634635925293
2.8605575561523438
3.1516401767730713
3.2725605964660645
2.9872331619262695
1.8360798358917236
2.4671669006347656
I logged this value: https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/engine.py#L120
It looks pretty normal to me.
Our code logs loss values in the command line or via Visdom. You should use these outputs to debug your results. But yes, that overall loss doesnt look pathological.
Unfortunately, because the training crashes very early, the losses do not appear in visdom.
You could set vis_and_log_interval=1
to see results after each iteration.
Nice, that worked.
But yeah, it looks like the losses are not the issue. I have no idea what to do ;(
Maybe there is something to be seen in the Visdom logging of the network outputs? Have you tried deactivating all data augmentations?
Except that there is a lot of region proposal, it seems fine. The groundtruths are well placed.
I just tried the deactivate all augmentations an I had the same results. Maybe it comes from the augmentation that creates the "fake" previous frame.
I have the following questions:
query_embed
comes from and how is it different from features
?Edit:
query_embed
is updated in the DETR
class right ?Man did u finally solved the problem? Eager to know how if so.
Hi,
Thanks for the great paper !
I try to train on a custom dataset (of one sample, attached to this issue), the data is not temporal. I run the following command:
As I train on static data (not temporal) I don't use the 'tracking' option. Upon running the code I face an assert inside the generalized_box_iou function: https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/util/box_ops.py#L51
In fact, the
boxes1
andboxes2
are full of nans.Following the mentioned COCO style in the TRAIN.md, I followd the suggested structure and here is my sample of my train.json file:
After investigation, I notice that the COCO annotation format uses box coordinates:
[xtl, ytl, w, h]
, where:xtl
: the top left xytl
: the top left yw
: widthh
: heightI'm I doing anything wrong ?
Best, Axel