kkaiwwana / MVPbev

[ACM MM24 Poster] Official implementation of paper "MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability"
18 stars 3 forks source link

some questions about paper #3

Open KevenLee opened 1 week ago

KevenLee commented 1 week ago

there are several questions:

  1. The article does not explicitly encode 3D traffic participants. Due to the lack of constraints, how can we ensure that the generated 3D objects conform to normal logic? For example, a car in a perspective image, could it possibly appear on top of a tree or on the surface of a river?
  2. Do the generated images need further annotation? If so, how should they be annotated?
  3. The testing process allows for control over instances, but it seems that it cannot precisely control a specific instance.
kkaiwwana commented 1 week ago
  1. We do not strictly implement foreground objects generation (i.e. 3d-bbox -> perspective objects) in fact (you may notice that we don't have any object generation metrics, e.g. IoU). Technically, generating foreground object violates our assumption that background are far enough w.r.t cameras location discrepancy. With that assumption, homography estimation can be applied to ensure background generation consistency and that's what we highlighted in our paper. Comparing with that, foreground objects are much closer to camera which is beyond our work (and we did not design a method for that). In short, foreground objects can NOT be tackled in a 2D space case where our method are working on. In our main paper Sec 3.2, we explained "Assuming that instance-level masks can be obtained at each view with either existing methods or simple retrieval", moreover, we explained that with more details in supplementary material Sec 5 (it should be available, let me know if it's not). It's a feasible solution, and notably, our foreground instance control are exactly based on that assumption: we can get instance mask in perspective view and not depending on how we get that mask. That's why we directly segment objects' mask from ground-truth image.

  2. No. An ideal case of using these generative methods is, i. plausible annotations are available (from some generation method), ii. generate images along with annotations for downstream task.

  3. Instead, it can precisely control a specific instance. In supplementary material, we explained our method in detail. Check up our impl code (tips: only class AttnProcessor is used. the other 2 functions are deprecated) and demo notebook.