jayLEE0301 / snerl_official

Official code for "SNeRL: Semantic-aware Neural Radiance Fields for Reinforcement Learning" (ICML 2023)
27 stars 6 forks source link

Question about NeRF pre-training #2

Closed SteveImmanuel closed 1 year ago

SteveImmanuel commented 1 year ago

Hi, Nice work on the paper.

I have some questions about the NeRF pre-training (stage 1). In your paper, you mentioned that the offline datasets consist of 14400 scenes where each scene consists of 3 images from different views. You also use 4 different environments, e.g. window-open-v2, soccer-v2, hammer-v2, drawer-open-v2. Could you please elaborate the followings:

  1. Does this mean you provide 14400 scenes for each environment or 14400 for all environments?
  2. The way I understand how NeRF works is that it reconstructs a single scene using images from multiple views. So if you have 14400 scenes, does this mean the NeRF reconstructs all of them at once or sequentially or some other way?
  3. To generate the pre-training dataset, to get 14400 scenes, do you instantiate the environment, perform random actions and policies from meta-world (as you mentioned), then capture the 3 cameras which then become a single scene, then repeat that 14400 times? Am I understanding that correctly?

Thank you.

jayLEE0301 commented 1 year ago

Thank you for your interest in our paper.

  1. We train separate encoders for each environment: window-open-v2, soccer-v2, hammer-v2, and drawer-open-v2. Also, we utilized 14400 scenes for each environment.
  2. As you mentioned, the original NeRF is for a single static scene. However, we learn single NeRF for the whole 14400 scenes of each environment. In other words, we train a single NeRF that handles dynamic scenes where objects and robots are moving. The mission of NERF is to synthesize all 14,400 different dynamic scenes from a latent vector encoded through CNN in each scene.
  3. Since the robot could not achieve the task in some environments with random actions, ours require expert trajectories (Metaworld provides the expert policy in its package), which is one of the limitations of our work. Thus, we got the dataset by adding noise to the expert policy provided by Metaworld, and rollout it repeatedly.
SteveImmanuel commented 1 year ago

Thank you for your clarification. One small follow-up for number 2, do the 14,400 scenes come from a single episode or multiple episodes? By single episode, I mean from the starting state (the agent starts to move) until the end state (task achieved/failed).

jayLEE0301 commented 1 year ago

We got the data from 120 episodes (120 timesteps per episode). Thank you.

SteveImmanuel commented 1 year ago

Thank you