timmeinhardt / trackformer

Implementation of "TrackFormer: Multi-Object Tracking with Transformers”. [Conference on Computer Vision and Pattern Recognition (CVPR), 2022]
https://arxiv.org/abs/2101.02702
Apache License 2.0
512 stars 116 forks source link

Training Custom Dataset #6

Open ConsciousML opened 3 years ago

ConsciousML commented 3 years ago

Hi,

Thanks for the great paper !

I try to train on a custom dataset (of one sample, attached to this issue), the data is not temporal. I run the following command:

python src/train.py
    with \
    deformable \
    mot17 \
    resume=models/crowdhuman_train_val_deformable/checkpoint.pth \
    output_dir=models/boxy_train_deformable \
    mot_path=my_custom_dataset \
    train_split=train\ 
    val_split=val \
    epochs=50 \

As I train on static data (not temporal) I don't use the 'tracking' option. Upon running the code I face an assert inside the generalized_box_iou function: https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/util/box_ops.py#L51

In fact, the boxes1 and boxes2 are full of nans.

Following the mentioned COCO style in the TRAIN.md, I followd the suggested structure and here is my sample of my train.json file:

{
    "type": "instances",
    "images": [
        {
            "file_name": "601038faba553c0015ac7c00_00000.png",
            "id": 0
        }
    ],
    "annotations": [
        {
            "id": 0,
            "bbox": [
                464.0,
                0.0002879999999976235,
                241.000448,
                228.99974399999996
            ],
            "image_id": 0,
            "segmentation": [],
            "ignore": 0,
            "visibility": 1.0,
            "area": 55189.04089588531,
            "iscrowd": 0,
            "category_id": 0
        }
    ],
    "categories": [
        {
            "supercategory": "human",
            "name": "person",
            "id": 0
        }
    ]
}

After investigation, I notice that the COCO annotation format uses box coordinates: [xtl, ytl, w, h], where:

I'm I doing anything wrong ?

Best, Axel

timmeinhardt commented 3 years ago

If you were able to reproduce our results by running the provided model file this issue is something you have to debug on your side. I would start with the data loading code and see if your sample is actually getting loaded and then work myself up the pipeline. If you run the Visdom server you should see your sample being visualized.

ConsciousML commented 3 years ago

Ok thanks for the response.

I see a command line example to train on custom temporal data:

python src/train.py with \
    deformable \
    tracking \
    mot17 \
    full_res \
    resume=models/mot17_train_deformable_private/checkpoint.pth \
    output_dir=models/custom_dataset_train_deformable \
    mot_path=data/custom_dataset \
    train_split=train \
    val_split=val \
    epochs=20 \

Quoting your paper:

Simulate MOT from single images. The encoderdecoder multi-level attention mechanism requires substantial amounts of training data. Hence, we follow a similar approach as in [56] and simulate MOT data from the CrowdHuman [43] person detection dataset. The adjacent training
frames t−1 and t are generated by applying random spatial
augmentations to a single image. To simulate high frame
rates as in MOT17 [28], we only randomly resize and crop
of up to 5% with respect to the original image size.

What I want to do is the exact same thing as training on CrowdHuman but with my own data. Can you confirm that I'm using the right command line options please ?

Also is [xtl, ytl, w, h] the right format as I described ealier ?

I've been debugging since two days and I would greatly appreciate if you could help ;)

Thanks !

Edit: I made sure that the coco annotated format was well formed, I checked that every box lie inside of the image. Unfortunately I can't see anything using Visdom.

timmeinhardt commented 3 years ago

Since you are loading the mot17 config your command is correct if you want to train on a multi-object tracking dataset. But since this is not the case for you, you need to change a few things. For example:

python src/train.py with \
    deformable \
    tracking \
    crowdhuman \
    full_res \
    resume=models/mot17_train_deformable_private/checkpoint.pth \
    output_dir=models/custom_dataset_train_deformable \
    crowdhuman_path=data/custom_dataset \
    train_split=train \
    val_split=val \
    epochs=20 \

Your removal of the tracking option was wrong since you want to do tracking. But you dont want to do it on a tracking (mot) dataset but a static (like crowdhuman) dataset which then simulates tracking frame-pairs.

What do you mean by "can't see anything" in Visdom?

ConsciousML commented 3 years ago

After trying this, I faced an assert here: https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/datasets/crowdhuman.py#L12

I fixed it by changing train_split=train to crowdhuman_train_split=train.

Then I noticed that by leaving mot_path to the default data/MOT17 it was searching for the mot17 data here: https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/datasets/mot.py#L161

I guess I had to set also mot_path=data/custom_dataset as I don't want to train on mot17 or train_split to None ? https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/datasets/mot.py#L190

It seems I now have a CUDA out of memory error. I will try on a bigger GPU.

Thanks again ! You were very helpful.

ConsciousML commented 3 years ago

I got rid of the full_res option and I was able to perform 50 iterations.

python src/train.py with \
    deformable \
    tracking \
    crowdhuman \
    resume=models/mot17_train_deformable_private/checkpoint.pth \
    output_dir=models/boxy_train_deformable \
    crowdhuman_path=/media/bigbro/Data/Dataset/Test/SmallCOCO \
    crowdhuman_train_split=train \
    mot_path=/media/bigbro/Data/Dataset/Test/SmallCOCO \
    train_split=None \
    val_split=val \
    epochs=50 \

But than I faced the same error here: https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/util/box_ops.py#L51

I guess it is my input data. After conversion in the COCO format, I cropped every bounding box from the image without error. I don't know what else to try. Do you have any clue about how this can happen and how can I debug it ?

ConsciousML commented 3 years ago

By the way, is it possible to perform the validation on static data ? (on CrowdHuman like dataset)

timmeinhardt commented 3 years ago

Yes it is possible to perform validation on static data but not for tracking metrics.

Try to use the Visdom visualization to see if your data is formatted correctly and have an overview on your losses etc.. Besides that it is hard to say where and why the error occurs if you dont provide the full stackstrace. Does the assert error happen during validation, tracking evaluation or somehwhere else?

ConsciousML commented 3 years ago

This is the stack trace:

Traceback (most recent call last):
  File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/bigbro/.vscode/extensions/ms-python.python-2021.5.842923320/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/home/bigbro/.vscode/extensions/ms-python.python-2021.5.842923320/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
    run()
  File "/home/bigbro/.vscode/extensions/ms-python.python-2021.5.842923320/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
    runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
  File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "src/train.py", line 336, in <module>
    train(args)
  File "src/train.py", line 263, in train
    visualizers['train'], args)
  File "src/trackformer/engine.py", line 106, in train_one_epoch
    outputs, targets, *_ = model(samples, targets)
  File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "src/trackformer/models/detr_tracking.py", line 45, in forward
    prev_indices = self._matcher(prev_outputs_without_aux, prev_targets)
  File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bigbro/anaconda3/envs/trackformer/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "src/trackformer/models/matcher.py", line 94, in forward
    box_cxcywh_to_xyxy(tgt_bbox))
  File "src/trackformer/util/box_ops.py", line 51, in generalized_box_iou
    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError

I will try to replicate the training with Crowdhuman and mot17 to see if I notice a difference with my grountruth.

timmeinhardt commented 3 years ago

Does this error occur immediately or after a few iterations? The former would be very strange indeed. I think this is related to your input data. You should debug the data flow and check if everything works as when training on CrowdHuman. For example is this code after the if statement executed properly:

https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/datasets/coco.py#L64

ConsciousML commented 3 years ago

Unfortunately, it occurs after a few iterations. Ok I will try to debug the code in this file.

I discovered that in Visdom, my training images seemed cut. Apparently is it not the case when I train with CrowdHuman: visdom.pdf

I try to train from a top view perspective and it seems that the image augmentation produces a weird result.

Edit: It is related to my input data cause it worked out of the box with CrowdHuman and MOT17. But I can't see any difference in the boxes, I made a script to visualize them and it crops CrowdHuman and my own data in the same fashion.

The only difference is that I didn't restrict the data to at least 2 annotations per frame. Do you think this can have an impact ?

ConsciousML commented 3 years ago

You should debug the data flow and check if everything works as when training on CrowdHuman. For example is this code after the if statement executed properly:

https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/datasets/coco.py#L64

I debugged and didn't notice anything weird. My guess is that maybe because the RandomSizeCrop is more aggressive on my data, maybe at some point it crops out the detection ?

I will check without the RandomSizeCrop.

Edit: Same error without the crop.

timmeinhardt commented 3 years ago

If it works for our CrowdHuman/MOT you have to figure out what the difference to your custom dataset is. You said the boxes in assert (boxes1[:, 2:] >= boxes1[:, :2]).all() contrains NaN values. This sounds like a problem that comes from training on your data. Not directly the data itself. Does the loss for a particular iteration go up a lot? What happens if you run a training and set the learning rates to zero. Does the error still occur?

ConsciousML commented 3 years ago

I noticed that when I face the assert, the out_bbox as well as out_probs contains Nans. I suppose it is the prediction of the model. https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/models/matcher.py#L71

Same thing when I set the lr to 0.

I will go up to see where the Nans first appear.

ConsciousML commented 3 years ago

I noticed that the Nans first appears in the query_embed variable: https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/models/deformable_detr.py#L167

Do you have any clue where this is updated ?

timmeinhardt commented 3 years ago

But you said NaNs appear even if you set all the learning rates to zero? This would be contradictive to the query embeddings becoming zero. Double check if this still happens if you set all rates to zero or comment out the optimizer step.

ConsciousML commented 3 years ago

You were right ! I might forgot to set a lr to 0. When I comment the optimizer step, it successfully runs a whole epoch.

ConsciousML commented 3 years ago

Hi @timmeinhardt,

I'm still stuck and can't fix the problem. Any idea about xhat to do ?

timmeinhardt commented 3 years ago

How do the loss values develop until the network produces NaN output values? If for example the bounding box loss goes up there might be something wrong with that.

ConsciousML commented 3 years ago

The overall loss before crash is:

7.164515495300293
4.680170059204102
4.274537086486816
5.886708736419678
4.5304131507873535
3.8330774307250977
7.964834690093994
3.0583243370056152
4.674091339111328
3.301724672317505
3.608858346939087
2.8133819103240967
2.8961634635925293
2.8605575561523438
3.1516401767730713
3.2725605964660645
2.9872331619262695
1.8360798358917236
2.4671669006347656

I logged this value: https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/engine.py#L120

It looks pretty normal to me.

timmeinhardt commented 3 years ago

Our code logs loss values in the command line or via Visdom. You should use these outputs to debug your results. But yes, that overall loss doesnt look pathological.

ConsciousML commented 3 years ago

Unfortunately, because the training crashes very early, the losses do not appear in visdom.

timmeinhardt commented 3 years ago

You could set vis_and_log_interval=1 to see results after each iteration.

ConsciousML commented 3 years ago

Nice, that worked.

newplot

But yeah, it looks like the losses are not the issue. I have no idea what to do ;(

timmeinhardt commented 3 years ago

Maybe there is something to be seen in the Visdom logging of the network outputs? Have you tried deactivating all data augmentations?

ConsciousML commented 3 years ago

Except that there is a lot of region proposal, it seems fine. The groundtruths are well placed.

I just tried the deactivate all augmentations an I had the same results. Maybe it comes from the augmentation that creates the "fake" previous frame.

I have the following questions:

  1. Do you know how can I try to train without generating the previous frame if possible ?
  2. Where does the query_embed comes from and how is it different from features ?

https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/models/deformable_detr.py#L167

https://github.com/timmeinhardt/trackformer/blob/686fe60858e4f61497a4544b95f94487614d0741/src/trackformer/models/deformable_detr.py#L135

Edit:

  1. Looks like query_embed is updated in the DETR class right ?
EMU1337X commented 2 years ago

Man did u finally solved the problem? Eager to know how if so.