michael-fonder / M4Depth

Official implementation of the network presented in the paper "Parallax Inference for Robust Temporal Monocular Depth Estimation in Unstructured Environments"
GNU Affero General Public License v3.0
84 stars 10 forks source link

Custom data #5

Closed dimaxano closed 3 years ago

dimaxano commented 3 years ago

Hi, @michael-fonder!

I'm trying to make evaluation on custom data. Can you please describe the purpose of multiplying by 4 in this line and later?

shawnrosofsky commented 3 years ago

Hi @dimaxanom

I'm trying to do this as well.

I believe this multiplication by 4 is due to the fact that the MidAir dataset samples the groundtruth position and attitude 4 times more quickly than takes images (100Hz for groundtruth vs 25Hz for taking images). Without this multiplication the images and groundtruth would not be at the same times. See here for details about the midair dataset.

For your custom data, you probably don't have to worry about this.

dimaxano commented 3 years ago

Oh, yeah, that makes sense, thank you!

One more question: am I right that it ok to put just a zeros matrix instead of a depth map when creating a custom dataset for evaluation purposes?

shawnrosofsky commented 3 years ago

Not sure. I haven't gotten that far, but if you don't have a true depth map then that means you'd be doing inference not evaluation, correct? I was planning to just use the function estimate_depth(self, rgb_im, rot, trans, focal_length) in the M4Depth model class for inference as it does not require a known depth map. Then, I would compare with known depth maps later.

dimaxano commented 3 years ago

I mean inference, yep

Sorry, but maybe you can explain one more question? :)

Why do we multiply by 2 rotation coefficients before putting them to the shard? And why do we put both translation and rotation coeffs in [1,2,0] order?

shawnrosofsky commented 3 years ago

Unfortunately, this is the part I am stuck on in my implementation. I'm also unsure what the quaternion operations before this do (It seems like it is trying to find the difference between the positions and orientations, but I'm not sure what frame this is in as there is a rotation applied to the position difference. Though that may just be me being rusty on quaternions).

shawnrosofsky commented 3 years ago

The reordering of the vectors the seems a bit weird, but it seems to be what is expected in the functions of the M4Depth model class and its reproject function in particular.

I was successfully able to make an inference script to run on some custom data composed of a recording of an AirSim flight. The results look okay, but not spectacular. Though the code does run pretty fast. It seems the net looks for the "sky" and assigns that "sky" to be very far away. I noticed that this "sky" can sometimes be roads, houses, etc. The network probably needs to be finetuned on additional datasets so it doesn't do this.

One annoying difficulty I had was loading the checkpoints. I needed to load 2 checkpoints separately and manually assign the variables. Here is the code I used for this section of the inference script:

# restore the checkpoint
ckpt1 = tf.compat.v1.train.latest_checkpoint(args.ckpt1)
inspect_list1 = tf.compat.v1.train.list_variables(args.ckpt1)
ckpt_vars1 = [x[0] for x in inspect_list1]
ckpt_vars1 = sorted(ckpt_vars1)
scope1 = 'M4Depth/features'
vars1 = tf.compat.v1.get_collection(tf.compat.v1.GraphKeys.GLOBAL_VARIABLES, scope=scope1)
vars1 = sorted(vars1, key=lambda x: x.name)
var_list1 = {ckpt_vars1[i]: vars1[i] for i in range(len(vars1))}

ckpt2 = tf.compat.v1.train.latest_checkpoint(args.ckpt2) 
inspect_list2 = tf.compat.v1.train.list_variables(args.ckpt2) 
ckpt_vars2 = [x[0] for x in inspect_list2]
ckpt_vars2 = sorted(ckpt_vars2)
scope2 = 'M4Depth/upscaler'
vars2 = tf.compat.v1.get_collection(tf.compat.v1.GraphKeys.GLOBAL_VARIABLES, scope=scope2)
vars2 = sorted(vars2, key=lambda x: x.name)
var_list2 = {ckpt_vars2[i]: vars2[i] for i in range(len(vars2))}

saver1 = tf.compat.v1.train.Saver(var_list=var_list1)
saver2 = tf.compat.v1.train.Saver(var_list=var_list2)
sess = tf.compat.v1.Session()
sess.run(tf.compat.v1.global_variables_initializer())

saver1.restore(sess, ckpt1)
saver2.restore(sess, ckpt2)

Where args.ckpt1 and args.ckpt2 default to trained_weights/M4Depth-d6/M4Depth/features and trained_weights/M4Depth-d6/M4Depth/upscaler respectively.

img_Camera45_0_1624998239971532000_merged

michael-fonder commented 3 years ago

Hi all,

Shawn is correct about the factor 4 here. It is indeed because the camera is sampled at 25Hz while the IMU is sampled at 100Hz in the Mid-Air dataset. You don't need this factor if your position is sampled at the same rate than the camera.

dimaxano asked:

One more question: am I right that it ok to put just a zeros matrix instead of a depth map when creating a custom dataset for evaluation purposes?

If you want to perform inference without modifying the code at all, than yes, it is the way to go even if it is suboptimal. I'll try to find some time to adapt the code for inference once our paper get published.

Why do we multiply by 2 rotation coefficients before putting them to the shard?

This is because we use the SO(3) parametrization for rotations (and therefore rotation matrices) in our code, not quaternions. We therefore have to convert the quaternion angles to SO(3) (more details here ). We do it by using the small angle approximation where we assume the real part of the quaternion to be close to one and where any product of two complex terms of the quaternion is close to zero. The factor 2 appears during this conversion.

And why do we put both translation and rotation coeffs in [1,2,0] order?

The positioning data was recorded with the NED axis convention. The code was written to work with the standard camera axis convention (x on the horizontal axis, y on the vertical axis and z for the depth). I reordered the axis to move from the NED to the camera axis convention.

michael-fonder commented 3 years ago

Hi Shawn,

One annoying difficulty I had was loading the checkpoints. I needed to load 2 checkpoints separately and manually assign the variables.

Thank you for your feedback. This is a remainder of some transfer learning experiments I performed. I didn't realize this would be an issue for reusing the code when I uploaded it. I'm adding this to my list of things to improve on this repository.

I was successfully able to make an inference script to run on some custom data composed of a recording of an AirSim flight. The results look okay, but not spectacular.

I shall admit that I would also have expected better results. Did you try to reproduce our results on the Mid-Air dataset with your modified code to be sure that it's working properly? If yes, it seems like the network would indeed need some finetuning. Please note in all cases that we tryied to reduce the importance of semantics for the prediction with our network, but in its current state, it still relies partially on it to make its prediction. As your inference data is very different from our train data, it is difficult to predict how well the network will behave without any finetuning.

michael-fonder commented 3 years ago

Hi all,

It's been a long time without any activity on this thread.

@dimaxano , did we answer all your questions or do you still have some?

@shawnrosofsky , did you manage to get things working? In the meantime, I realized that something may not have been clear to you. The motion expected by the network must be expressed with respect to the camera frame of reference. In your test, the camera is tilted downward. So the transformation to get your motion from the drone body frame of reference to the one of the camera will be more complex than what I have in my script. If you don't do the transformation properly, the reprojection module won't perform any meaningful operation and the network output will be garbage. If you have access to the ground truth depth map, you can write a test script to test if the reprojection module works as expected with your motion information by reprojecting RGB pictures and see if the reprojected pictures match the reference ones.

dimaxano commented 3 years ago

Hi @michael-fonder

A bit more question on data preparation:

  1. I Implemented a pipeline for making inference on a single image (i.e. creating sequence of 5 same images with Identity transformation between them). Do you that will worsen depth estimation results?
  2. Also, I want to try training on custom data from AirSim. Maybe you have code gathering necessary data from AirSim like in MidAir?
michael-fonder commented 3 years ago

Hi @dimaxano,

I Implemented a pipeline for making inference on a single image (i.e. creating sequence of 5 same images with Identity transformation between them). Do you that will worsen depth estimation results?

The network is designed to work on image sequences by using the camera motion and the difference in the visual content of the images of the sequences. It won't work as expected for inference on one single picture. If you try to 'cheat' by giving it a sequence that consists in copies of the same picture, it will always predict an infinite depth.

With that said, in the code, you can set the sequence length to 1(see --seq_len). In that case you need to set --special_case to 1 to have a successful network build. When you do so, the network will try to rely on semantics to predict depth, but its performance will be way worse (~50% poorer scores from our tests). So, in short if you have access to an image sequence, try to use it ;)

Also, I want to try training on custom data from AirSim. Maybe you have code gathering necessary data from AirSim like in MidAir?

Yes, I have the code used to gather the data of Mid-Air, but I'm sorry to tell you that we don't share it for the following reasons (among others). First, it was written a while ago for a customized version of Airsim. So, it probably doesn't work out of the box with an up-to-date Airsim plugin. Second, in order to gather all the data we needed, we had to setup a quite convoluted pipeline involving different softwares. The setup and the data generation are quite complex and not user friendly at all.

I can however help you if you have some specific questions on that topic.