About training data - Githubissues

TomSuen commented 1 month ago

Hi, glad to see such good work！I still have some questions:

Noticed that:

We sample 220k training samples from our internal video dataset to train our model. ...During training, we randomly sample [1, 20] points pairs. We randomly crop a square patch covering the sampled points and resize to 512 × 512.

Would the internal video dataset be released? And could you tell me in detail how to make the training set in advance?

liewjunhao commented 1 month ago

Hi, thank you for your interests in our work. Unfortunately, we won’t be able to release the internal video dataset.

To construct the training set, we first curated videos with static camera movement. Internally, we have another project that has pre-processed the data so we can simply reuse it. However, we also found get_moved_area_mask from AnimateAnything (https://github.com/alibaba/animate-anything/blob/main/utils/common.py#L88) to work well in general. You can simply retain those videos whose "unmoved mask" area is larger than a certain threshold.

You can find the remaining steps in Section 4.1 of our arXiv paper.

TomSuen commented 1 month ago

Hi, thank you for your interests in our work. Unfortunately, we won’t be able to release the internal video dataset.

To construct the training set, we first curated videos with static camera movement. Internally, we have another project that has pre-processed the data so we can simply reuse it. However, we also found get_moved_area_mask from AnimateAnything (https://github.com/alibaba/animate-anything/blob/main/utils/common.py#L88) to work well in general. You can simply retain those videos whose "unmoved mask" area is larger than a certain threshold.

You can find the remaining steps in Section 4.1 of our arXiv paper.

Thank you for your reply. I will sort out the process again. Maybe some questions later.

TomSuen commented 1 month ago

Hi, thank you for your interests in our work. Unfortunately, we won’t be able to release the internal video dataset.

To construct the training set, we first curated videos with static camera movement. Internally, we have another project that has pre-processed the data so we can simply reuse it. However, we also found get_moved_area_mask from AnimateAnything (https://github.com/alibaba/animate-anything/blob/main/utils/common.py#L88) to work well in general. You can simply retain those videos whose "unmoved mask" area is larger than a certain threshold.

You can find the remaining steps in Section 4.1 of our arXiv paper.

As you mentioned in 4.1, We begin by curating videos with static camera movement, simulating drag-based editing where only local regions are manipulated while others remain static.

How to determine whether a video is static camera movement? What algorithm was used?

TomSuen commented 1 month ago

Hi, thank you for your interests in our work. Unfortunately, we won’t be able to release the internal video dataset.

To construct the training set, we first curated videos with static camera movement. Internally, we have another project that has pre-processed the data so we can simply reuse it. However, we also found get_moved_area_mask from AnimateAnything (https://github.com/alibaba/animate-anything/blob/main/utils/common.py#L88) to work well in general. You can simply retain those videos whose "unmoved mask" area is larger than a certain threshold.

You can find the remaining steps in Section 4.1 of our arXiv paper.

Finally, we adopt a similar approach as in Dai et al. [12] to extract a binary mask M highlighting the motion areas, indicating regions to be edited. Does this mention mean that I should manually obtain the mask by labelme?

liewjunhao commented 1 month ago

To extract the binary motion mask M, we use get_moved_area_mask from AnimateAnything (https://github.com/alibaba/animate-anything/blob/main/utils/common.py#L88).

To curate videos with static camera movement, in our project, we simply reused the internally pre-processed data. The rough idea is to compute the homography between the two sampled frames and compare it with an identity transformation. Ideally, if there is no camera movement, the homography should be close to an identity transformation.

For simplicity, you can also use the same get_moved_area_mask above to identify videos with static camera movement if the extracted "unmoved mask" area is larger than a certain threshold. However, we also notice some drawbacks with this simple approach: (1) videos with black borders will be falsely detected as videos with static camera; (2) drone videos where the pixel difference in the sky region is too small across frames.

TomSuen commented 1 month ago

To extract the binary motion mask M, we use get_moved_area_mask from AnimateAnything (https://github.com/alibaba/animate-anything/blob/main/utils/common.py#L88).

To curate videos with static camera movement, in our project, we simply reused the internally pre-processed data. The rough idea is to compute the homography between the two sampled frames and compare it with an identity transformation. Ideally, if there is no camera movement, the homography should be close to an identity transformation.

For simplicity, you can also use the same get_moved_area_mask above to identify videos with static camera movement if the extracted "unmoved mask" area is larger than a certain threshold. However, we also notice some drawbacks with this simple approach: (1) videos with black borders will be falsely detected as videos with static camera; (2) drone videos where the pixel difference in the sky region is too small across frames.

I noticed that thers are 2 parameters in get_moved_area_mask, move_th and th，which values did you adopt in the dataset processing？Or is it enough to just generate a mask that roughly covers the moving subject?

TomSuen commented 1 month ago

To extract the binary motion mask M, we use get_moved_area_mask from AnimateAnything (https://github.com/alibaba/animate-anything/blob/main/utils/common.py#L88).

To curate videos with static camera movement, in our project, we simply reused the internally pre-processed data. The rough idea is to compute the homography between the two sampled frames and compare it with an identity transformation. Ideally, if there is no camera movement, the homography should be close to an identity transformation.

For simplicity, you can also use the same get_moved_area_mask above to identify videos with static camera movement if the extracted "unmoved mask" area is larger than a certain threshold. However, we also notice some drawbacks with this simple approach: (1) videos with black borders will be falsely detected as videos with static camera; (2) drone videos where the pixel difference in the sky region is too small across frames.

And Next, we sample N handle points P_hdl on I_src with a probability proportional to the optical flow strength, ensuring the selection of points with significant movement. What is the range of N, is it [1, 20]? Or for example, first take 30 points and then randomly select [1, 20] points during training?

liewjunhao commented 1 month ago

I noticed that thers are 2 parameters in get_moved_area_mask, move_th and th，which values did you adopt in the dataset processing？Or is it enough to just generate a mask that roughly covers the moving subject?

I used the default parameters. We find coarse mask to be sufficient in general.

And Next, we sample N handle points P_hdl on I_src with a probability proportional to the optical flow strength, ensuring the selection of points with significant movement. What is the range of N, is it [1, 20]? Or for example, first take 30 points and then randomly select [1, 20] points during training?

We set N to 100. During training, we randomly sample [1, 20] points from these 100 points.

TomSuen commented 1 month ago

We set N to 100. During training, we randomly sample [1, 20] points from these 100 points.

Thank you!

TomSuen commented 3 weeks ago

And Next, we sample N handle points P_hdl on I_src with a probability proportional to the optical flow strength, ensuring the selection of points with significant movement. What is the range of N, is it [1, 20]? Or for example, first take 30 points and then randomly select [1, 20] points during training?

@liewjunhao Hi, I have new questions that if I train on realistic dataset, can the model perform well in cartoon image drag? And how to sample N handle points with a probability proportional to the optical flow strength, is there any code that could be found in github?

liewjunhao commented 3 weeks ago

Hi, I have new questions that if I train on realistic dataset, can the model perform well in cartoon image drag?

The performance on cartoon images depends on the capability of the base Stable Diffusion inpainting model since the base UNet is frozen during training (of course, it also depends on the trainable Appearance Encoder too). Alternatively, you can swap the base Stable Diffusion model with any personalized model during inference as done in https://civitai.com/models/3128/anything-v3-inpainting. However, frankly speaking, we have not validated this before.

And how to sample N handle points with a probability proportional to the optical flow strength, is there any code that could be found in github?

You can use np.random.choice(points, size=N, p=prob, replace=False) where prob is the normalized optical flow strength such that the sum is 1.

TomSuen commented 3 weeks ago

You can use np.random.choice(points, size=N, p=prob, replace=False) where prob is the normalized optical flow strength such that the sum is 1.

Okay, in fact, I have been thinking about a question about the construction of the training set. If the point I sampled in the 1st frame moves outside the image in the 2nd frame, this should cause unknown errors in the training, right?

So the best situation should be that no matter whether it is a static background or a moving foreground, all pixels are still inside the entire image after moving, but I think it is difficult to find a large amount of such data.

Yujun-Shi commented 2 weeks ago

You can use np.random.choice(points, size=N, p=prob, replace=False) where prob is the normalized optical flow strength such that the sum is 1.

Okay, in fact, I have been thinking about a question about the construction of the training set. If the point I sampled in the 1st frame moves outside the image in the 2nd frame, this should cause unknown errors in the training, right?

So the best situation should be that no matter whether it is a static background or a moving foreground, all pixels are still inside the entire image after moving, but I think it is difficult to find a large amount of such data.

Oh actually, the CoTracker-2 will identify a point as "invisible" if this kind of out-of-bound situation happened for the tracked point. So what you have to do is simply filtering the points to ensure handle and target points are all stay within the image area.

TomSuen commented 2 weeks ago

Oh actually, the CoTracker-2 will identify a point as "invisible" if this kind of out-of-bound situation happened for the tracked point. So what you have to do is simply filtering the points to ensure handle and target points are all stay within the image area.

This reply is really important to me! Thx!

magic-research / InstaDrag

About training data #3