Closed meigaoms closed 3 years ago
The KITTI dataset was taken from the VTAB benchmark, and the task was to predict how distant a vehicle is in the photo. You can find the preprocessing script here, and we used the following four text prompts:
[
'a photo i took of a car on my left or right side.',
'a photo i took with a car nearby.',
'a photo i took with a car in the distance.',
'a photo i took with no car.',
]
Admittedly this is a bit vague, but we included KITTI since the VTAB benchmark is one of the standard benchmarks for visual tasks.
Thanks for pointing me to VTAB script. From their script, near_objects: tf.where(x["objects"]["location"][:, 2] < 25) far_objects: tf.where(x["objects"]["location"][:, 2] >= 25) object_on_left: tf.where(x["objects"]["location"][:, 0] < 0)
If there're multiple vehicles in one picture, it's possible that there is one car nearby and one car on the left, maybe one car in the distance as well. Does that mean this dataset has multi-labels for one image? If not, could you share the logic you use to assign these 4 text prompts? Thanks again!
Their implementation has six different tasks, and three of which is counting objects in all/nearby/far/on-left, whereas the vehicle distance prediction uses the following thresholds:
def _closest_vehicle_distance_pp(x):
"""Predict the distance to the closest vehicle."""
# Label distribution:
# Location feature contains (x, y, z) in meters w.r.t. the camera.
vehicles = tf.where(x["objects"]["type"] < 3) # Car, Van, Truck.
vehicle_z = tf.gather(params=x["objects"]["location"][:, 2], indices=vehicles)
vehicle_z = tf.concat([vehicle_z, tf.constant([[1000.0]])], axis=0)
dist = tf.reduce_min(vehicle_z)
# Results in a uniform distribution over three distances, plus one class for
# "no vehicle".
thrs = np.array([-100.0, 8.0, 20.0, 999.0])
label = tf.reduce_max(tf.where((thrs - dist) < 0))
return {"image": x["image"], "label": label}
as you can find in this function. Notice the difference between "vehicle" and "object".
So this specific task uses 4 mutually exclusive classes, also described in the VTAB paper, table 2. The different train set size in this case is likely due to the different version of the dataset. The version we used from tfds had 6347 train and 423 validation images. Hope this helps!
Great to know the version of the dataset your are using. Thanks for the confirmation.
I had seen two issues with VTAB related to Kitti and scratched my head about what to do...
https://github.com/google-research/task_adaptation/issues/9 https://github.com/google-research/task_adaptation/issues/18
@pj-ms Were you able to replicate the results of CLIP on KITTI distance ? Because when I am trying to do so given the above help, the numbers are quite off. Can you help regarding this ? Maybe I am missing something. I need help with loading the dataset properly and using it to test with CLIP zero shot transfer.
I'm also trying to replicate the results and getting different numbers. I'm wondering if this is an issue with data augmentation, it seems like the images are quite wide in this dataset. Was center square cropping used for this dataset?
@Akashcodes732 did you manage to match the numbers in the paper?
@Akashcodes732 @gabrielilharco we are able to reproduce for Resnet50, Resnet101, efficientnet_b0 and ViT-B/32.
Except for the issues from task_adaption, key takeaway is dont do center crop for kitti, as the image is very wide. Center crop ruins everything.
Thanks @pj-ms! What data augmentation did you end up using?
Thanks @pj-ms! What data augmentation did you end up using?
Except for no-crop, nothing different from any other dataset
Got it, thank you @pj-ms. I'm still getting 28.9% accuracy with ViT-B/32 (in the paper the reported number is 39.4%). I'm using this augmentation:
Compose([
Resize((224,224), interpolation=Image.BICUBIC),
_convert_image_to_rgb,
ToTensor(),
Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
])
Is this what you used? And you matched the 39.4% number?
@gabrielilharco ah, sorry, I did not read it carefully. Somehow we were able to reproduce the linear-probing results, not the zero-shot results. For ViT-B/32, we get 10% lower.
We stopped investigating on it.
Got it, thanks for the info @pj-ms!
Kitti dataset has a lot information annotated, as I found here Tensorflow kitti. It's mentioned that the task is to recognize the distance to the nearest car. However, I'm unable to locate anything with the same class number (4) in Table 9. Could you please disclose more details about what ground truth you use in linear probe?