libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.86k stars 180 forks source link

Object detection (DETR/ DefDETR) support and varying size float tensors per image #70

Closed VovaTch closed 2 years ago

VovaTch commented 2 years ago

Hello, I couldn't find anything regarding Coco or similar datasets... I have a dataset with varying number of bounding box detections per image, and I want to use FFCV to speed up my training (transformer-based nets are slow to train...). Is there a way to include a field with tensors/ndarrays of varying length without resorting to JSON workarounds or starting messing with additional padding?

Also, on a related note, do you have options for bounding box augmentations like for example bbaug?

GuillaumeLeclerc commented 2 years ago

I assume that at each training step you will need all the bounding boxes. Knowing that bounding boxes are small objects I don't think that it is worth resorting to variable storage methods (which are slower than constant ones). My first intuition would be to use a fixed array of dimension (K, 4) where K is the max number of bounding boxes (and keep the actual number of bounding boxes used in a particular sample in a separate IntField).

However the best is always to have something that is as close as what the network is going to be ingesting. Can you clarify exactly how is the model using the bounding boxes?

VovaTch commented 2 years ago

So let me be clear on what my model gets and spits out. My model takes an RGB image (clear enough how to do so from the documentation), and outputs Nx4 bounding box coordinates, NxC class logits, and an NxN matrix with values between 0 and 1. The targets are K << N bounding boxes, N object classes with most of them being "empty", and KxK matrix, which I use np.ix_ to extract the relevant values. Like in DETR, the K targets are matched to N bounding boxes via Hungarian matching. K is constant (300 for DefDETR), and N is changed on per image basis.

My vanilla loader gives me the RGB, and separately all the targets as a dictionary. I want to do something similar with FFCV, because training the DETR backbone is slow. The most straightforward approach is to pad the targets to match the output size, but for now at least I want to avoid it, but apparently I don't have much choice (?).

GuillaumeLeclerc commented 2 years ago

I personally strongly recommend doing the padding. Constant size is usually the most performant solution (there is a reason why the network itself uses a constant).

However if you really want to avoid that you can use a variable ByteArray field and then just use view to convert it back to float. If there is a real reason for not using padding I could find a way to make it more straightforwards.

VovaTch commented 2 years ago

Ok, I'll try padding.

To my other point, I do think adding support for bounding box augmentations can be highly beneficial, e.g. mirror-flipping the bounding box coordinates along with the image. Otherwise, I think the issue can be closed.

firsakov commented 2 years ago

@VovaTch hi! how did you manage to handle different shapes of bbox array? I set label as NDArray field of size (max number of bboxes, 5). (class + 4 coordinates) and getting this type of error

ValueError: could not broadcast input array from shape (40,) into shape (300,)
VovaTch commented 2 years ago

Try padding with zeros, it should get the job done. Eventually I didn't use FFCV for the project because it doesn't support the type of bounding box data augmentation that I need.

FrancescoSaverioZuppichini commented 1 year ago

I have the same issue with the shape