Open electro-logic opened 7 years ago
Hey @electro-logic, the project is still in works.
I've pretty much got a version working on the CPU, but it's way too slow (10+ min / frame), so I'm trying to get something working on the GPU.
I don't have a timeline for it yet but working on it consistently so I'll ping this back when it's working reasonably well :).
Thank you for your answer @mihaibujanca , could you share some details about how to use CPU instead of GPU in your project? If CPU implementation is working well I can help you to speed up implementation.
Please checkout the gpu_optimisation branch, the addition there is calling getWarp().energy_data
inside kinfu.cpp.
The main reason master doesn't reconstruct well is that the warp field estimation step is not called there (check equations 6 and 7 in the paper). The main problem is that I'm currently using Ceres for that optimisation and that only seems to support CPU. The optimisation requires estimating 6 variables per warp field node - the warp field is currently initialised with the first frame, so that means about 6 * 250k = 1.5M variables to be computed per each frame (and the warp field needs to grow over time).
One way of speeding it up to begin with would be subsampling from the first frame and creating a more sparse warp field, which I'll need to do anyway. But in the end, a GPU implementation will be necessary for optimisation.
I'm currently looking at http://github.com/niessner/opt and http://docs.nvidia.com/cuda/cusolver for GPU optimisation, but it will take me a bit to learn how to use them
@electro-logic Due to the optimisation being too slow, the way I tested whether or not it's working is:
I wouldn't recommend leaving that in if you run a dataset such as umbrella.
I have compiled gpu_optimisation branch with umbrella dataset (only frames from 100 to 150) and I got this output for the first frame:
Device 0: "GeForce GT 750M" 4039Mb, sm_30, 384 cores, Driver/Runtime ver.8.0/7.50
iter cost cost_change |gradient| |step| tr_ratio tr_radius ls_iter iter_time total_time
0 3.224206e+06 0.00e+00 4.70e+02 0.00e+00 0.00e+00 1.00e+04 0 5.47e+01 5.55e+01
1 5.038094e-04 3.22e+06 5.88e-03 2.36e+02 1.00e+00 3.00e+04 1 1.35e+02 1.91e+02
2 4.087742e-10 5.04e-04 6.66e-08 4.07e-03 1.00e+00 9.00e+04 1 1.18e+02 3.09e+02
3 6.638671e-11 3.42e-10 1.41e-08 1.42e-03 1.00e+00 2.70e+05 1 1.35e+02 4.44e+02
4 1.079882e-11 5.56e-11 5.19e-09 1.04e-03 1.00e+00 8.10e+05 1 1.23e+02 5.67e+02
5 1.664163e-12 9.13e-12 1.25e-09 6.85e-04 1.00e+00 2.43e+06 1 1.44e+02 7.11e+02
6 5.191461e-13 1.15e-12 2.46e-10 4.11e-04 1.00e+00 7.29e+06 1 1.65e+02 8.76e+02
7 2.560969e-13 2.63e-13 9.02e-11 4.27e-04 1.00e+00 2.19e+07 1 1.44e+02 1.02e+03
Solver Summary (v 1.13.0-eigen-(3.3.4)-lapack-suitesparse-(4.4.6)-cxsparse-(3.1.4)-openmp)
Original Reduced
Parameter blocks 45850 45850
Parameters 275100 275100
Residual blocks 96020 96020
Residual 288060 288060
Minimizer TRUST_REGION
Sparse linear algebra library SUITE_SPARSE
Trust region strategy LEVENBERG_MARQUARDT
Given Used
Linear solver SPARSE_SCHUR SPARSE_SCHUR
Threads 8 8
Linear solver threads 8 8
Linear solver ordering AUTOMATIC 4426,41424
Schur structure 3,6,6 d,d,d
Cost:
Initial 3.224206e+06
Final 2.560969e-13
Change 3.224206e+06
Minimizer iterations 8
Successful steps 8
Unsuccessful steps 0
Time (in seconds):
Preprocessor 0.7368
Residual evaluation 0.5212
Jacobian evaluation 375.9973
Linear solver 642.0986
Minimizer 1019.6177
Postprocessor 0.0095
Total 1020.3640
Termination: CONVERGENCE (Gradient tolerance reached. Gradient max norm: 9.021969e-11 <= 1.000000e-10)
but GUI is frozen and I can't see anything.
@electro-logic yep, that's where I'm at right now. The cost being low enough (2.560969e-13) and the test on handmade data suggests that it should be working correctly. I'm not sure why the gui freezes, but I'm expecting that if I made the optimisation work on GPU, it would be easier to understand.
I might try subsampling the for the warp field initialisation tonight since that's easy to do and it might be enough to give some decent results for warping.
Quick update on this - I tried uniformly subsampling for the warp field initialisation and depending on the subsampling size, it seems to be working on a few (<5) frames, before optimisation fails due to loads of nans.
I'm gonna be working on porting this to GPU since right now it's too slow to visualise or debug properly
I have investigated, GUI frozen because all work is done in a single thread. It is better to do work in a separate thread and keep the main thread free so that it can process message queue and be responsive. This can involve using boost:thread or c++ 11 async features.
At this stage is too early to do this, before is better to improve the algorithm and then handling this stuff.
Maybe with the work on the GPU, the main thread can be free and responsive without additional work.
Anyway a quick and dirty fix can be calling viz.spin() and viz1.spin() before the call to dynamic_fusion(depthdevice) to give the user a way to view the data. Then the user can press Q to proceed with the next frame. This can work now for debugging purposes because every frame is very long to calculate.
Just for curiosity: What OS / GPU / CUDA toolkit version are you using for development?
My setup is Ubuntu 16.04 / Nvidia GeForce 960M 4GB / CUDA 8.0. Yeah I could spin up another thread but it won't do much since all the user would be doing is see the same image / pointcloud until a new frame is processed - maybe look around at the pointcloud, but I can't see any obvious advantage in doing that at this point.
@mihaibujanca awesome project. I am also doing the same for the last one year. I am currently using Opt for optimization. It can improve the overall optimization time but in my experience, dynamic fusion type subsampled graph based optimization required some more modification over Opt. (currently I am doing full mesh optimization) I tried volumedeform( https://arxiv.org/abs/1603.08161) optimization also using Opt, but unfortunately that significantly increases the time for a TSDF grid size of (128x128x128). So in my opinion Opt can bring down the time about 2 secs, but for making work in real time really required hand crafted Jacobian estimation.
@eshafeeqe Thanks a lot for the feedback!
I'm currently working on getting it to work with Opt, but there are plenty of other things that need to be improved, in all fairness. I'm working on getting this to be part of OpenCV as well, so eventually the code will need to run in real time. I thought about building VolumeDeform instead, but it seemed like DynamicFusion would be an easier option to begin with and VolumeDeform could be tried later.
Would love to chat about this and any tips or contributions would be more than welcome :).
I'm not that great with cuda so getting the optimisation to be fast is taking me a while
I think that also if real-time can not be reached with current hardware, in the future mainstream GPU will improve and so if we get 0.5 fps today this is a great reach. Still allow for offline processing, where a 1 min sequence taken at 30 fps (1800 frames) can be processed into 1h.
I'm a newcomer of 3D-reconsruction and also implementing my idea based on this code project (what an awesome project I must say). Out of curiosity, how is this project going now? Is the reconstructed video closed to that in DynamicFusion paper? Actually, I tried this project on Windows VS2013 but get an error (bad_alloc) about the problem solving process in WarpField::energy_data . I'm using Ceres with EigenSparse instead of SuiteSparse and still trying to figure it out ....
@KevinLee752 Thanks! Curious to know more about what you're thinking of doing.
I'm working on it a few hours every day and it's a serious priority for me - but at the same time I have other commitments and time is limited.
I am currently working on a version that uses https://github.com/niessner/Opt for the optimisation instead of Ceres - since Ceres is CPU only. Seeing that a few people are taking interest in this I'll probably focus on getting the documentation up to date and building the project to be easier (there are some hardcoded paths and other issues I didn't get to address yet).
In terms of where the project is, I'd just say active development and don't really have an ETA for it. The reconstruction is currently failing because of some issue in my warp field formulation and I'm trying to write tests and solve it. It's probably still a few weeks away from being close to the paper results (especially in terms of being a real time system), but I'm hoping that by mid-November the reconstruction will look decent while having a reasonable speed for offline processing
@mihaibujanca Hi Mihai,
I recently ran this project, and the umbrella reconstruction result obtained is very rough, not as smooth and detailed as in the paper. Is it because CUDA reconstruction results are not as good as CPU? Is the reconstruction result of ceres
smoother and finer?
Thank you!
@mihaibujanca The strange thing is that when I ran your branch ceres test fixed, I found that it runs faster (13s per frame) and you can see the process of real-time reconstruction from rough to smooth. Can you help me out, thank you very much!
Hello Mihai,
I have tried your project but I get a very bad reconstructed umbrella (bundled sample), very far from DynamicFusion video of the paper. Is this project still a work in progress and not yet ready or maybe I have made some mistake?
Thank you