Closed GerhardArya closed 1 year ago
Hello, the official address for downloading the pre-training model of the project is invalid. If you have swint-nuimages-pretrained.pth, could you share it with my email address [balms123456@gmail.com], Wish you all the best in scientific research.
@GerhardArya , i am also trying with my custom PCLs and used load_dims as 4 but i am ending up with the below error at the end of the first epoch, plz refer to the below log
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[9838,1],0] Exit code: 1
Let me know if you come across this error and if you have any workaround.
@kentang-mit , any comments on the above error?
@VeeranjaneyuluToka I haven't run into that error since I'm currently still trying to create the custom create_data, converter, dataset, and create_gt_database needed to get my dataset running with BEVFusion.
I haven't had the chance to try training with my custom dataset yet since my dataset doesn't have stamps and so on, so I can't use nuscenes evaluation like BEVFusion. So I have to try reverse engineer nuscenes evaluation and customize it to my needs since I'm trying to make this integration as close to BEVFusion as possible.
I'm taking a slightly different approach to you so far. My dataset doesn't have sweeps on top of sample. So, instead of removing the 5th dimension, which is timestamp relative to the sample (when it is a sweep), I treat each frame I have as a sample and just set 0 as the 5th dimension, which is how BEVFusion seems to treat that dimension for samples. I'm not sure if this understanding is correct, so any input from the authors would be highly appreciated.
But, I'll keep an eye out for this error if I get it as well.
@VeeranjaneyuluToka Just an update for you, I have finished the first versions of all the code I was working on now and I'm currently training the LiDAR bbox detector part.
To do this, I'm basically reusing the nuscenes voxelnet configs with some changes to fit my dataset and the data it contains. I have trained 9 epochs so far at the time of writing and I don't seem to run into your issues. Metrics seem to be developing sort of okay so far.
I will keep an eye open for the issue, if (but hopefully not) it appears in the future.
Based on what I could get from your stack trace, it seems like configuration issue? Maybe you could check if load_dims is properly changed everywhere or just do what I did and use load_dims=5 and fill the 5th dimension (timestamp) properly. For sample frame it is always 0. For sweeps, calculate it as: (sweep_ts / 1e6) - (sample_ts / 1e6), which is how BEVFusion seems to be doing it.
Sorry folks, I was busy working on other projects recently and I finally got a chance to process bevfusion issues today. Actually the number of feature dimensions does not quite matter for the whole codebase, and it is totally fine to change the load_dims
from 5 to 4 and adjust the input channels to the model accordingly. Timestamps, however, could be very helpful for the final performance. The important thing is that you want to assign different numbers for points from different time. The specific formulation of these numbers may not be that important. You can even use relative frame index to achieve similar performance.
Best, Haotian
@kentang-mit Thanks for the reply and no worries! :smile: One question I still have is related to yaw.
If my dataset has data and labels in LiDAR coordinate system, the LiDAR is mounted with x axis already pointing forward, and I'm using a copied and customized version of nuscenes evaluation protocol that evaluates in LiDAR coordinate system (AFAIK nuscenes evaluates in global coordinate system), is it actually okay for me to ignore the -yaw - 1/2 pi radian that is usually applied to yaw angles?
I'm just wondering since I have successfully trained, evaluated, and visualized the results of a voxelnet0p075 model trained on LiDAR only while removing this operation to the yaw angle. The results are okay (0.56-0.57 mAP on test set, I might need to tune the score threshold for transfusion since I seem to have some false negatives on 0.1 but also a train of empty false positives with the default 0.0).
But, there seems to be some inaccuracies with orientation and doubled/tripled detection. The orientation is generally correct (but not perfect) but there are frames where it is wrong by almost 90 degrees and multiple detections of the same object (mostly with busses/trucks/trailers and far away vehicles). Do you have some suggestions on things that might help with improving orientation accuracy?
An example:
@kentang-mit Nevermind. I found the cause of the worst of these misalignments was that it seems like I still need to change my yaw angles in the ground truth in my dataset to -yaw. Once I did that, the issue is generally solved.
Although, saying that, the orientation is still wrong in some edge cases (near-blind spots or further away areas, areas with few points). Do you have any suggestions on what to do to improve orientation accuracy for these areas?
Another thing is that training camera + lidar seems to not boost mAP at all in my case. It seems to even reduce mAP a little bit.
LiDAR only:
LiDAR + Camera:
One thing to note from my dataset is that it is infrastructure data, taken from sensors mounted on a gantry and not from a car at ground level. Could this affect the pretrained camera backbone since it is already trained for vehicle at ground level (nuImages), causing this weird result of LiDAR vs Camera + LiDAR?
Also it has 1920 training frames, 240 validation, and 240 testing. I don't know if this is too small to get a result comparable to your results.
Also, I would like to maybe change the backbones in the future (Pointpillars for LiDAR and maybe YOLOv8's CSP_Darknet for images). Are there anything that I would need to particularly look out for when trying to do this?
@GerhardArya , i still have same issue with 4D point clouds, would you mind sharing your config fields changes here? And also do you try to train camera alone model?
@VeeranjaneyuluToka Yes, I did try to train camera only model but the results were absolutely horrible. 0 mAP overall and for every class after 20 epochs. Not sure what is going on with camera only. Have you tried camera only before? Could you check if my camera only configs were correc?
Configs for camera + LiDAR: configs_cam_lidar.txt
Configs for camera only: configs_cam_only.txt
Here are the visualizations of camera and LiDAR feature maps from the camera + LiDAR training:
Camera:
LiDAR:
Note: The visualization was created quickly so the grid is not representative of the actual data. 0,0 is directly in the middle in my data and front for the LiDAR is down in that visualization.
It seems like the camera backbone can't extract features that make sense, while the LiDAR backbone did a good job. I tried visualizing from camera only training as well but the result is basically the same as the camera feature map above.
@GerhardArya , Thanks for quick reply. I have been trying on camera only model and the bbox loss tensor shows 0 as shown here https://github.com/mit-han-lab/bevfusion/issues/371. Plz have a look and let me know if you have any suggestions.
And also tensorboard only displays LR and momentom, any config changes that i need to do, so that i will get train and val losses also in tensorboard?
And also i tried to use tools/visualize.py script and visualize my GT, it does not show any bbox on image. i primarily suspect on transformations. if you see implementation here https://github.com/mit-han-lab/bevfusion/blob/main/mmdet3d/datasets/nuscenes_dataset.py (line no 260) lidar2image is just this lidar2image = camera_intrinsics @ lidar2camera_rt.T, i think some where it has to bring both lidar and camera coordinates into the same coordinate system. Am not sure where that is happening? Discussing more on this topic in this issue https://github.com/mit-han-lab/bevfusion/issues/394
I will check your camera only config and get back to you. Thanks!
@VeeranjaneyuluToka Hmmm unfortunately I don't quite know yet what might cause your issues. I will try training a camera only model again later today but right now I'm trying to make sure that my transformations are all correct and then retrying to do camera + lidar finetuning on my existing lidar only model.
Regarding visualization script, I think it uses only your lidar2image, which is a projection matrix. My projection matrix was already correct (I got it from my dataset) since I could display my GT correctly using that script. Right now I'm trying to make sure that my lidar2camera and other matrices matrix are correct, assuming my intrinsics are already correct.
Regarding tensorboard, unfortunately I haven't touched tensorboard so far. My feature map visualization was done by inserting a code in bevfusion.py that saves it. When I'm training I simply comment the lines calling that function out. If I ever decide to use tensorboard, I'll come back to you.
Update: After trying to fix/get the correct transformation matrices, I managed to get the feature map to make a bit more sense and to look similar to the camera fovs from the dataset.
Camera:
LiDAR:
The remaining issue now is the mAP. Camera + LiDAR mAP is still lower than LiDAR only at 0.5533 vs 0.5747. Any ideas on why this could be happening and what I could do to solve this?
Good! Are you trying on modified NuScenes or your own dataset? what kind of transformations matrices corrections you made?
I am currently experimenting with completely new dataset which is generated in house with our own sensor setup. So i have to generate all new transformations that it needs.
A vague idea is that can we visualize BEV features of both modalities and check their alignments? My only gut feeling is that if there is right transformations and GTs, then it should work as expected.
I'm trying my own (infrastructure POV) dataset.
In my case, I assumed that my projection matrix and camera intrinsics were already correct since I know that I already could visualize my GTs correctly using BEVFusion's visualization script. I then calculated my extrinsics (transformation matrix) from there and validated them by visualizing them using Open3D and checking their locations there. I then made sure to place them correctly to ensure that the values I inserted would be as close as possible to BEVFusion's nuscenes database.
I think the BEV features aligned pretty well and matches their locations IRL in my case based on the visualizations I made above. It's just that for whatever reason, camera seems to not contribute to a better mAP and actually makes mAP lower instead.
I also tried to train a camera only model on my data again and after changing the LR (original cyclic LR causes losses to go to NaN and breaks the training when it goes high again) to:
lr_config: min_lr_ratio: 0.0001 policy: CosineAnnealing warmup: linear warmup_iters: 500 warmup_ratio: 0.33333333
Which is copied from one of the transfusion model configs. I managed to get it to finish training and it ended training with 0.55ish mAP (which is horrible since LiDAR only or fusion can reach 0.88ish mAP during training on my data in my experience). But, when I tried to evaluate it on test set its mAP dropped to 0.07. When I visualized it, it showed a lot of false positives around objects it's supposed to detect.
Currently I'm trying to maybe change some configs for centerhead increasing score threshold from 0.1 to 0.3, increase min_lr_ratio ti 0.001, and then train again. But I'm not too optimistic. I'm utterly confused on what is going on with the camera module since it seems to perform horribly on my data.
Maybe the fact that it is pretrained on nuImages (vehicle perspective) and me trying to finetune that to try detect classes with different naming convention (nuImages: car vs. my dataset: CAR) and a different infrastructure perspective is the cause? I'm not sure.
Hopefully the author comes back soon and could shed some light on what could be happening...
@GerhardArya , thanks for getting back here. I have a simple question, does BEVFusion assumes LiDAR is in FLU (Forward Left Up) coordinate system? I was thinking that it needs ENU ( East North Up) as nuscenes is in it. What coordinate system your LiDAR data that you are feeding to BEVFusion is? Thanks!
@VeeranjaneyuluToka If I remembered correctly, it was mentioned somewhere that it needs FLU for LiDAR. Nuscenes is in ENU but if you noticed, yaw is processed using -yaw-pi/2 rad when the pickle files are generated, essentially transforming it from ENU to FLU. At least, if my understanding is correct. My data is already in FLU (as far as I know) but for whatever reason it needs to be preprocessed using -yaw first because otherwise the GT is wrong (mirrored), resulting in wrong predictions. Although, I don't think this -yaw transformation is related to my current issues with the camera backbone.
@GerhardArya , is this the conversion that you are talking about in nuscenes_converter.py at line 276?
@VeeranjaneyuluToka If I remembered it correctly, that's one of them yes.
But, I also made my own version of the nuScenes evaluation protocol used in BEVFusion. I basically read nuScenes' evaluation code, copied, and then changed it to fit my needs. I needed to do this because my data doesn't have nuScenes' tokens, which are used for everything within nuScenes including evaluation.
There was also something similar to this calculation in that evaluation code and several other files, if I remembered it correctly. I simply searched in my IDE (Visual Studio Code) for "-rots - np.pi / 2" or even just "- np.pi / 2" and changed it to what I needed in places/files where it makes sense for my case.
@GerhardArya , Ok! A quick question on lidar alone model
Voxelization returns zero sized tensor in case of validation data, it works fine for training.
feats shape: torch.Size([0, 4]) coords shape: torch.Size([0, 4]) sizes shape: torch.Size([0])
Any idea on this behavior?
@VeeranjaneyuluToka Not really... Because in my case it works fine for training, validation, and test.
Did you use the nuscenes-dataset.py class or did you write your own dataset class python file? I made my own custom dataset class, with its own custom evaluation methods, my own converter script, etc. They are all based on their nuScenes equivalents that BEVFusion uses but I changed quite a bit.
For example, my whole system doesn't use tokens but uses timestamps instead. It doesn't use the custom nuscenes classes in the evaluation methods I lifted and modified from the nuScenes evaluation protocol but uses dicts instead. And a lot of other changes. But the structure of the info objects in the dataset pkl file generated by my custom converter script, how data is handled within the dataset class, etc. is similar to the nuscenes one so it still functions pretty similarly overall to BEVFusion's nuScenes.
One thing that might be happening in your case could be a bug somewhere in your dataset class when handling validation data or something like that. But I can't say for sure since I never had this problem.
@GerhardArya , thanks for your reply again. My approach is same as yours which means i have my own class for my dataset by having timestamps. It is just that i am trying with a camera and a Lidar feeds.
I am trying to change point_cloud_range but ending up with the below error
File "/home/hykeserver/anaconda3/envs/bevf_ptt19/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 33, in
I noticed you also changed it, did you face this kind of error, is it because of invalid range?
@VeeranjaneyuluToka I don't quite remember how I solved it because it happened a while back. But I did also get a similar error at some point. The issue for me was that I ran out of VRAM.
On another note, I solved the issue with my LiDAR only model not reaching the performance published on Github readme. My issue was that in my 10 classes, 1 class stayed 0 mAP no matter what I did because it had extremely few representation in the dataset. I removed the class and worked with just 9 and I managed to get within around 1 mAP (63.09 mAP) of the published nuScenes validation set LiDAR-only performance (64.68 mAP).
The remaining issue now is that using fusion only increased performance by around 0.17 mAP and not the 3.84 mAP from the table. Meaning while fusion no longer causes performance to decrease, it still doesn't really help either.
(CC: @kentang-mit )
@GerhardArya , Ok! that is good to hear.
I have one more question, i noticed that you have changed the point cloud range, is not it?
did you notice some issues if you do not change it?
I think it is important to change point cloud range and voxel_size based on our dataset, is not it?
Hi @GerhardArya,
Regarding result reproduction, there are several things you can try that I found out to be helpful.
First, the GT database generation logic in our public release does not match with my internal implementation. The problem lies in the origin here. Changing it back to [0.5, 0.5, 0.5] will give correct GT database. Otherwise the cropped point cloud within each box might be wrong.
Second, make sure you rerun the tools/test.py
to evaluate the results after training is finished. The mAP and NDS reported during training are lower than normal values. Some of my colleagues reported that it could be related to the test_mode
parameter in the dataset. If you change that to True during training then the mAP/NDS reported could match the separate evaluation results. I haven't tested it extensively but it seems to be worth trying.
For finetuning, I would suggest you to first start from the official checkpoint and the recommended training setting because I have experimented with that setting for multiple times and can guarantee that it is relatively easy to get the reported results.
Hi @VeeranjaneyuluToka,
I made a reply to your latest issue about the voxel size. Would you mind having a look at it?
Best, Haotian
@kentang-mit I will try the first suggestion. It seems like that might help since my bounding boxes also have their center at [0.5, 0.5, 0.5]. I've just started a new LiDAR only training with the new values. I'll edit this reply later with the results.
Edit: For LiDAR only I managed to get 64.4 mAP and fusion managed to get 65.1 mAP (around 0.7 mAP increase with fusion). Considering the data I have (considerably more dense LiDAR than nuScenes and only 2 non-overlapping cameras), this seems to be around the best that I could do for now. So, I'm closing this issue for now.
For the second point, every mAP I reported comes from running tools/test.py. So I think this is fine in my case unless I misunderstood your suggestion.
For the third point, my latest result was using basically the recommended settings. I only changed point cloud range to what my dataset has, post center range to fit the new point cloud range, grid size and other params to fit the new point cloud range, and sample group of DB Sampler to match the class distribution in my dataset a bit better. Other than those, nothing was changed.
Thanks a lot for publishing the code for your great work!
I'm currently working on trying to get BEVFusion to run with a custom dataset.
I know that the nuscenes LiDAR points are in the format of (x, y, z, intensity, ring_index). But it seems like BEVFusion eventually replaces ring_index with timestamp somewhere down the road. Where does this happen?
The custom dataset I want to use unfortunately only has (x, y, z, intensity) for the LiDAR points. My question is, is timestamp value crucial for BEVFusion to work properly?
I'm not sure, if I understood the code correctly so far but it seems like it is not used anywere. Would it be okay if I omit it and set load_dims as 4 or maybe just fill it with zeroes in the converter for my custom dataset?
Thanks in advance for the help!