jozhang97 / DETA

Detection Transformers with Assignment
Apache License 2.0
241 stars 20 forks source link

Sometimes fails to meet pre_nms_topk with only two classes #23

Open td-anne opened 1 year ago

td-anne commented 1 year ago

I am running DETA on a data set with only one real class (and one N/A class; in particular various tensors are n by 2). In some long runs, the run fails with RuntimeError: selected index k out of range at the line below:

https://github.com/jozhang97/DETA/blob/985fa0b7afbbd86db6f907ff3a855828947ff631/models/deformable_transformer.py#L188

If I understand correctly, this should only be failing if the number k requested from topk, in this case pre_nms_topk, which is 1000, is too small; specifically I believe this can only happen if the length of the lvl_mask is less than 1000. (Perhaps my data augmentation has produced an unreasonably tiny image? I thought they were all rescaled.) I don't really understand where we are in the code when this occurs, but would it be harmful to trim the k supplied to topk down to the available length?

td-anne commented 1 year ago

In fact I think I may know what has happened. First, I have set the input image rescaling to at most 800 for the longest side (1333 overflows my GPU RAM when images need to be padded out to 1333x1333). Second, my image augmentation (using albumentations.BBoxSafeRandomCrop) may, rarely, produce one-pixel-wide images. If these are rescaled to produce 800x1 images, then there aren't more than 800 values in lvl_mask. Does this sound plausible?

jozhang97 commented 1 year ago

Yes, if you have fewer classes, it makes sense to have fewer predictions. It should be fine to change the class-agnostic topk. We tried a couple values and did not find too much of a difference.

Your 800x1 images could also be a problem. Though there could be more proposals since we have multi-level features.

You can also try out checkpointing to avoid GPU OOM.

td-anne commented 12 months ago

The 800x1 images are, obviously, not of any use, so I don't care what values get returned as long as it doesn't crash. The checkpointing is interesting, though: could the model cope with 1920 by 1080 images? Or does that require changing the structure somewhat? My raw inputs are all 1920 by 1080 and I'm looking for broken wires, which might disappear when downscaled. For the moment I'm more interested in accuracy than speed.

jozhang97 commented 12 months ago

I see that makes sense for high resolution. We typically use larger images during pre-training so I don't think 1920x1080 should be a problem.