kami93 / CMU-DATF

GNU General Public License v2.0
72 stars 15 forks source link

Cannot repreduce Argoverse result #4

Open TsuTikgiau opened 4 years ago

TsuTikgiau commented 4 years ago

Hello! I tried to train a model on Argoverse using the default arguments in the readme (only change the dataset from nuscenes to argoverse) --tag='Global_Scene_CAM_NFDecoder' --model_type='Global_Scene_CAM_NFDecoder' \ --dataset=argoverse --batch_size=4 --num_epochs=100 --gpu_devices=3 \ --map_version '2.0' In total the model converged well after 35 epochs and I stopped training at 48th epoch. However, there is a big performance gap between my reproducing (minADE:1.112, minFDE:1.802, validation set) and the number I found in the paper (minADE: 0.806 minFDE: 1.252). I guess there are something I missed. Is it my argument setting correct? Thank you!

os1a commented 4 years ago

Hi @TsuTikgiau ,

I was wondering how were you able to train. I got the following error when I start training:

File "main.py", line 637, in train(args) File "main.py", line 309, in train trainer.train(args.num_epochs) File "/misc/lmbraid21/makansio/CMU-DATF/Proposed/utils.py", line 95, in train train_loss, train_qloss, train_ploss, train_ades, train_fdes = self.train_single_epoch() File "/misc/lmbraid21/makansio/CMU-DATF/Proposed/utils.py", line 277, in train_singleepoch z, mu, sigma, motionencoding, sceneencoding = self.model.infer(future_agents_traj+perterb, past_agents_traj, past_agents_traj_len, future_agent_masks, episod e_idx, decode_start_vel, decode_start_pos, num_past_agents, scene_images) File "/misc/lmbraid21/makansio/CMU-DATF/Proposed/models.py", line 422, in infer interp_locs = torch.cat((init_loc, prev_locs), dim=1) # [A X Td X 2] RuntimeError: Sizes of tensors must match except in dimension 1. Got 48 and 4 in dimension 0

I use the same training parameters as the ones you used. Would appreciate some help there.

TsuTikgiau commented 4 years ago

Hi @os1a, yes I also faced this problem before when I tried to run experiments on argoverse. I checked their code and found that this is because decode_start_vel and decode_start_pos contains the velocity and location of the agents we don't predict. I remove them by changing the code here to episode = (past_agents_traj, past_agents_traj_len, future_agents_traj, future_agents_traj_len, future_agent_masks, decode_start_vel[future_agent_masks], decode_start_pos[future_agent_masks], map_image, prior, scene_id) basically just use the future_agent_masks, which indicates which agent we want to predict, to select the agent we want out. I remember in addition to this part, some other locations also have the same issue. Just change it to the correct format and it will work at the end

os1a commented 4 years ago

Hi @TsuTikgiau

Thanks for your answer. I got another problem where the agent_tgt_three_mask is not defined. https://github.com/kami93/CMU-DATF/blob/master/Proposed/utils.py#L288

Best, Osama

TsuTikgiau commented 4 years ago

Hello @os1a According to my understanding, this is the three_mask here @kami93 @argyroneta-aquatica Can you confirm whether my modification is correct?

os1a commented 4 years ago

Thanks, it would be also nice if they can tell us more about the meaning of the three_mask and two_mask? And why are they needed?

kami93 commented 4 years ago

Hi, @TsuTikgiau

According to our research note, options we used for that specific ablation was --model_type='Global_Scene_CAM_NFDecoder' --num_epochs=100 --agent_embed_dim=128 --batch_size=64 --dataset='argoverse' --map_version='2.0'

kami93 commented 4 years ago

Thanks, it would be also nice if they can tell us more about the meaning of the three_mask and two_mask? And why are they needed?

@os1a They are used for two primary purposes; 1) flow models require the output shape fixed so inverse mappings can work properly, and thus we filter out those samples without full 3-seconds future; 2) They are used to gather outputs of length 2-seconds and 3-seconds, when calculating FDE2/ADE2 and FDE3/ADE3, respectively.

os1a commented 4 years ago

@TsuTikgiau @kami93 During training and after it finishes training the first epoch, I got the following error during validation: Traceback (most recent call last): File "main.py", line 643, in train(args) File "main.py", line 315, in train trainer.train(args.num_epochs) File "/misc/lmbraid21/makansio/CMU-DATF/Proposed/utils.py", line 96, in train valid_loss, valid_qloss, valid_ploss, valid_ades, valid_fdes, scheduler_metric = self.inference() File "/misc/lmbraid21/makansio/CMU-DATF/Proposed/utils.py", line 469, in inference decode_start_vel = decode_start_vel.to(self.device)[agent_tgt_three_mask] IndexError: The shape of the mask [16] at index 0 does not match the shape of the indexed tensor [4, 2] at index 0

Any idea how to solve it.

kami93 commented 4 years ago

@os1a Can you write down full command arguments? I would be able to help you more effectively if I can reproduce your problem.

os1a commented 4 years ago

I started training with the following: python main.py --tag Global_Scene_CAM_NFDecoder --model_type Global_Scene_CAM_NFDecoder --dataset argoverse --batch_size 4 --num_epochs 100 --train_cache caches/argo_train_cache.pkl --val_cache caches/argo_val_cache.pkl --map_version 2.0

Apparently the error appears in https://github.com/kami93/CMU-DATF/blob/3a818d58ea6fa2901602b13f1745a51b049d125a/Proposed/utils.py#L466 https://github.com/kami93/CMU-DATF/blob/3a818d58ea6fa2901602b13f1745a51b049d125a/Proposed/utils.py#L467

If I replace the agent_tgt_three_mask with three_mask, then it works. Is this right?

TsuTikgiau commented 4 years ago

Hello @kami93 Do you also have the argument for attglobal? And is it OK for you to offer the trained weights? Thank you!

TsuTikgiau commented 4 years ago

Hello @kami93 I tried the argument you list for Global_Scene_CAM_NFDecoder --model_type='Global_Scene_CAM_NFDecoder' --num_epochs=100 --agent_embed_dim=128 --batch_size=64 --dataset='argoverse' --map_version='2.0' Unfortunately, the performance looks similar to before, I get a Valid minADE[2/3]: 0.7793 / 1.1940 | Valid minFDE[2/3]: 1.0902 /1.9218 after converge for 12 hypothesis... Maybe I do something wrong @os1a have you made the code runnable and do you get some results already?

os1a commented 4 years ago

@TsuTikgiau Yes, the code is running but the training is somehow slow. It would take at least 2 days to finish. Do you have any suggestion to make it faster? What confguration are you using?

TsuTikgiau commented 4 years ago

@os1a if you change the batch size from 4 to 64, it would be relatively fast and converge inside 5 hours for Global_Scene_CAM_NFDecoder (about 50 epochs, rtx2080ti, this is also the setting they suggest). I notice that the performance is worse than the batch size 4. All the other other arguments are default value, which are consistent with the values they gave in this issue for AttGlobal_Scene_CAM_NFDecoder, the running time for 1 epoch at 64 batch size is about 10 mins in my case, slower than Global_Scene_CAM_NFDecoder, I haven't finish it but looks like it will converge inside 10 hours (RTX Titian). The batch size setting of AttGlobal in the paper is 4. All the other other arguments are default value.

os1a commented 4 years ago

@TsuTikgiau I have noticed that the main bottelneck in the training time is the data loading. Did you arrange to put the data on a SSD? The default value for num_workers is 20, and I have noticed that during training it prints for 20 iterations and then hangs for sometime and prints again for 20 iterations. This is how I assume it is a data loading issue.

TsuTikgiau commented 4 years ago

@os1a Oh yes my hard disk is NVMe... In my case batch size 4 also takes about 2 days to finish, but I don't face this dataloading hardware issue

os1a commented 3 years ago

@TsuTikgiau

I have justed realized that the number of candidates when training argoverse should be 6 (the default is 12). See Section 6 (Generalizability across datasets) of the paper.

BTW, did you manage to reproduce the results for the AttGlobal_Scene_CAM_NFDecoder?

TsuTikgiau commented 3 years ago

Hello @os1a, Yes, I cannot reproduce their results for the AttGlobal for 6 candidates actually even with 12 candidates....

liwangcs commented 3 years ago

Hello @kami93 @os1a @TsuTikgiau,

I encountered some data preprocessing problems when I was experimenting on the Argoverse dataset. The error appears on line 284 and 316 in argoverse.py (function: _getdata and _extractdirectory). NotADirectoryError: [Errno 20] Not a directory: './data/argoverse/train/1.csv/observation'

The structure of Argoverse data in my project is:

-data
  |- argoverse
    |- train
       |- 1.csv
       |- 2.csv
    |- val
    |- test

Can you show the structure of Argoverse data? The readme file does not describe in detail.

Thank you!

MohammadHossein-Bahari commented 3 years ago

Hello @os1a @TsuTikgiau @liwangcs @Manojbhat09, Could any of you reproduce the results? Does anyone know how to preprocess the data?

kami93 commented 3 years ago

Hello all. Thank you for your interest in our model.

Please check the update at #9 (comment)