nianticlabs / manydepth

[CVPR 2021] Self-supervised depth estimation from short sequences
Other
621 stars 84 forks source link

Training with Lyft #13

Closed didriksg closed 3 years ago

didriksg commented 3 years ago

Hi, I have some questions regarding training with a custom dataset.

(I noticed that my issue became a bit lengthy, so here's a TL;DR):

  1. Can I use images pointing in more than one direction to increase the number of samples in the dataset?
  2. Do I need to modify the intrinsic matrix when cropping the images
  3. Can I use images from different cameras, with different dimensions, but points in the same direction?

More in-depth questions

I'm trying to use the data from the Lyft dataset. It contains images from multiple cameras, all pointing in different directions. I've mainly used the front-facing camera, but I'm not sure how good the result actually is. I've attached some samples of the original data and its corresponding disparity images:

Original image image

Disp_mono: image

Disp_multi: image

Training stats after 42k batches: image image image

As you can see, the model has clearly learned the most important principles, but I still feel that these disparity images are not as good as those created by training with the Kitti dataset.

The total number of images in the dataset from the front-facing camera is ~17 000. I guess that the model would benefit from more data, but this leads me to my questions

Do you think it would be possible to use data from cameras pointing in different directions simultaneously as I use data from the front-facing camera? I'm a bit concerned about how this will affect the pose network, as the cameras move differently compared to each other. The Lyft vehicles are utilized with cameras in the following setup:

image

Another possibility that I might try is to use the backward-facing camera. Using this in reverse temporal order would simulate the car moving forward (although with some other views than the forward-facing ones).

I have also tried to crop the images a bit, as the original images contain the lower part of the vehicle. By doing so, I have also changed the cx and cy parameters in the intrinsic matrix. (I used Berkley Automations library here: https://berkeleyautomation.github.io/perception/api/camera_intrinsics.html), but I'm not quite sure if I should change the intrinsic at all. I've done it like this:

# This is defined in __init__()
self.crop_value = (4, 200, 4, 216)

# The intrinsic matrix is different for each vehicle, so each sequence contains the associated vehicle's intrinsic.
path = pathlib.Path(self.data_path + folder).parent
K = np.fromfile(f'{path}/CAM_FRONT_k_matrix.npy')
K = K.reshape(3, 3)

fx = K[0, 0] 
cx = K[0, 2]
fy = K[1, 1]
cy = K[1, 2]

# Initialize the camera intrinsic params.
cam_intrinsics = CameraIntrinsics(
            fx=fx,
            fy=fy,
            cx=cx,
            cy=cy,
            width=self.full_res_shape[0],
            height=self.full_res_shape[1]
        )

# Calculate the new dimensions and center points.
cropped_width = self.full_res_shape[0] - self.crop_value[2] - self.crop_value[0]
cropped_height = self.full_res_shape[1] - self.crop_value[3] - self.crop_value[1]

# The center points are the original center points + (0.5 * the number of cropped pixels on the bottom) - (0.5 * the number of pixels cropped on the top)
crop_cj = (self.full_res_shape[0] - self.crop_value[2] + self.crop_value[0]) // 2
crop_ci = (self.full_res_shape[1] - self.crop_value[3] + self.crop_value[1]) // 2

# Generate the new cropped intrinsics.
cropped_intrinsics = cam_intrinsics.crop(
    height=cropped_height,
    width=cropped_width,
    crop_ci=crop_ci,
    crop_cj=crop_cj,
)

# Create the 4x4 version.
intrinsics = np.array([[cropped_intrinsics.fx, 0, cropped_intrinsics.cx, 0],
                       [0, cropped_intrinsics.fy, cropped_intrinsics.cy, 0],
                       [0, 0, 1, 0],
                       [0, 0, 0, 1]]).astype(np.float32)

# Resize fx and fy by the original dimensions and cx, cy by the cropped dimensions.
intrinsics[0, 0] /= self.full_res_shape[0]
intrinsics[1, 1] /= self.full_res_shape[1]
intrinsics[0, 2] /= cropped_width
intrinsics[1, 2] /= cropped_height

I have also noticed that some of the sequences in the Lyft dataset contain images in different dimensions. Some of the images are in 1224x1024, and some in 1920x1080. As long as I normalize the intrinsic matrix with the corresponding image dimensions, do you think it would be any problems with using these images simulatenously? One possibility is maybe to crop both images so that they are in the same format, if this is possible (as per my other question).

mdfirman commented 3 years ago

Interesting ideas! Thanks for sharing these.

I'm surprised that the disp_multi output is so blurry. I would guess that there might be a bug somewhere in how your multiple views are being preprocessed or used. It might be worth carefully checking tensorboard images (e.g. checking cost volume minimums) to check that sensible things are being done when it comes to intrinsics, extrinsics etc.

Using multiple views

I would say to start with that the idea of using the backwards camera (in reverse) seems very sensible – a good idea! I would avoid using the side cameras though, at least to start with; as you point out the pose network is going to have a hard job there, and those cameras are going to be seeing quite different things to what the front and back cameras are seeing.

Cropping with intrinsics

I agree you should have to change the intrinsics when you crop – but I'm not sure I quite follow the logic you're using here:

# The center points are the original center points + (0.5 * the number of cropped pixels on the bottom) - (0.5 * the number of pixels cropped on the top)

I'm also not sure of all the conventions used in Berkley.

Overall – when you crop an image like so:

cropped_image = uncropped_image[crop_top:(crop_top + crop_height), crop_left:(crop_left + crop_width)]

my understanding is that the principle point changes as follows:

cropped_cx = uncropped_cx - crop_left
cropped_cy = uncropped_cy - crop_top

, and the focal lengths don't change at all. Perhaps you could check that this is happening in your code?

Different image sizes

In theory – yes! But in practice this introduces a lot of potential for hard-to-find bugs in intrinsics and extrinsics especially. So make sure a simple version (e.g. where you only use sequences with one single image size) works well first!

didriksg commented 3 years ago

Finished updating

Hi! I have been able to test out some of the ideas I mentioned. This has resulted in some interesting results.

Using multiple views I have tested a bit using the backward-facing camera simultaneously with the front-facing camera. I reverted the temporal order for the backward-facing camera and treated the images as a separate "scene," consisting of between 100 and 120 frames. I have provided a GIF displaying the images in a scene:

ezgif-4-575b2022f45b

Here, I have cropped the image so that the bottom part of the vehicle is not showing, resulting in ~300px. I have also cropped the top part with the same amount. The total number of samples in my training set is now ~35 000.

When training with this dataset, I now get some interesting-looking disparity images:

Loss values: image image image

Back-facing camera: color_0 image

color_pred_1 image

color_pred -1 image

consistency_mask image

disp_mono image

disp_multi image

Front-facing camera: color_0 image

color_pred_1 image

colorpred-1 image

consistency_mask image

disp_mono image

disp_multi image

These results look slightly like the results mentioned in your paper about moving objects and using the baseline model without the consistency loss. However, from my understanding, the disparity maps generated from a single image, in that case, were looking OK.

Cropping I noticed that my train of thought might be a bit vague. I'm unsure whether the principal point should represent coordinates in the original image or for the cropped one. E.g. if I have an image of 1200x1000, with an cx of 600 and cy of 500. If I crop it by 300 pixels on the top and bottom, resulting in a 1200x400 image. Should my cx and cy now represent coordinates in the 1200x1000 image, leaving the values the same, or should it represent coordinates in the new image, resulting in cx as 600 and cy now at 250?

I will try to change my intrinsic calculation to your suggestions, and retrain the network:

    def load_intrinsics(self, folder, frame_index):
        path = pathlib.Path(self.data_path + folder).parent
        cam_name = folder.split('/')[-1]

        K = np.fromfile(f'{path}/{cam_name}_k_matrix.npy')
        K = K.reshape(3, 3)

        fx = K[0, 0]
        cx = K[0, 2]
        fy = K[1, 1]
        cy = K[1, 2]

        cropped_height = self.full_res_shape[1] - self.crop_value[3] - self.crop_value[1]
        cropped_width = self.full_res_shape[0] - self.crop_value[2] - self.crop_value[0]

        intrinsics = np.array([[fx, 0, cx, 0],
                               [0, fy, cy - self.crop_value[3], 0],
                               [0, 0, 1, 0],
                               [0, 0, 0, 1]]).astype(np.float32)

        intrinsics[0, 0] /= self.full_res_shape[0]
        intrinsics[1, 1] /= self.full_res_shape[1]
        intrinsics[0, 2] /= cropped_width
        intrinsics[1, 2] /= cropped_height

        return intrinsics

Btw, these are the training parameters used:

mdfirman commented 3 years ago

Nice! Yes the disp_multi you have here look much more sharp than you posted before. Did you change anything else?

didriksg commented 3 years ago

The results are from training with a larger dataset (with the added backward-facing camera images) and changing the intrinsic matrix accordingly to my first comments. No changes other than those.

mdfirman commented 3 years ago

Super – thanks for reporting back on this. Very interesting results.

I hope now that disp_multi gives results on a par with (or ideally better than) disp_mono.

didriksg commented 3 years ago

Do you have any thoughts on the cars (moving objects) that are detected as far away? I have noticed this behavior both when using front only and backward/forward cameras. I also noticed it for another dataset I am training on as well (DDAD from Toyota Research Institute) when only using the front-facing camera.

mdfirman commented 3 years ago

Yes – this 'hole punching' behaviour is pretty common when training on monocular videos with moving objects.

This is discussed in some detail in the monodepth2 paper ('Auto-Masking Stationary Pixels' section), and, to a lesser extent, the ManyDepth paper.

Automasking in monodepth2 helps a little with these, but doesn't solve completely. You might want to look to some more recent works e.g. [1] if they are causing you significant bother. (Or perhaps consider a more hacky solution e.g. using semantics with some heuristics)

[1] Hanhan Li, Ariel Gordon, Hang Zhao, Vincent Casser, and Anelia Angelova. Unsupervised monocular depth learning in dynamic scenes. In CoRL, 2020

didriksg commented 3 years ago

It's not really a big problem at the moment, but it surely is something that I'll look into improving if possible! Thank you so much for your help and input! Really appreciate you taking your time with your detailed answers! :)