kallaballa / libnfporb

Implementation of a robust no-fit polygon generation in a C++ library using an orbiting approach
GNU General Public License v3.0
106 stars 32 forks source link

Investigate libnfporb performance #31

Closed intractabilis closed 1 year ago

intractabilis commented 2 years ago

The orbital approach is expected to be very efficient, at least this is the impression I got after reading papers. However, in my experiments it is at least 2 orders of magnitude slower than CGAL's reduced convolution method. It would be nice to investigate if the slowness is a defect of the algorithm, or we can improve the implementation.

I've sent the test data privately. It consists of 36 polygons numbered from 0 to 35. On the polygon pair (2, 18) libnfporb hangs indefinitely. The following pairs: (3, 14), (3, 16), (3, 17), (8, 17), (12, 14), (12, 16), (12, 17), (15, 20), (21, 24), (24, 25), and (28, 30) end up throwing an exception with a message "Unable to complete outer nfp loop: 1".

Here is the performance comparison on pairs where libnfporb is the slowest compared to CGAL:

Run on (16 X 3874.93 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1280 KiB (x8)
  L3 Unified 24576 KiB (x1)
Load Average: 0.93, 0.71, 0.68
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27    4523011 ns      4520737 ns          155
benchmark_cgal/24/31    2876381 ns      2874814 ns          244
benchmark_cgal/24/26    1987337 ns      1986189 ns          352
benchmark_cgal/24/33    2306633 ns      2305401 ns          297
benchmark_cgal/23/25    1619837 ns      1619050 ns          430
benchmark_orb/24/27  5694047455 ns   5692760057 ns            1
benchmark_orb/24/31  3049576536 ns   3048878609 ns            1
benchmark_orb/24/26  1877847975 ns   1877413520 ns            1
benchmark_orb/24/33  2259959241 ns   2259457033 ns            1
benchmark_orb/23/25  1518002236 ns   1517642500 ns            1
kallaballa commented 2 years ago

It's due to several defects in this much cited paper. For example they propose to use the lowest point of polygon B as a reference point... But what if there are two or more equally low points? And that's just one example that breaks their algorithm... And I think I know why they where convinced they had a good solution, while it breaks on so many of the tests i created - They used random generated data for testing and random generated data has an extremely low chance of forming parallels between polygon A and B. That is what it comes down to (also the first example is such a case): parallels. They just didn't consider them. Anyway, libnfporb is a collection of workarounds that make the orbiting approach work kind of as proposed. The bad news: that makes libnfporb slow. The good news: It all comes down to one function (trimVector) that consumes virtually all the cpu time in my tests. But I haven't checked your results yet.

kallaballa commented 2 years ago

Out of curiosity, why are you looking for a NFP implementation? Can you talk about it?

kallaballa commented 2 years ago

The orbital approach is expected to be very efficient, at least this is the impression I got after reading papers. However, in my experiments it is at least 2 orders of magnitude slower than CGAL's reduced convolution method. It would be nice to investigate if the slowness is a defect of the algorithm, or we can improve the implementation.

I've sent the test data privately. It consists of 36 polygons numbered from 0 to 35. On the polygon pair (2, 18) libnfporb hangs indefinitely. The following pairs: (3, 14), (3, 16), (3, 17), (8, 17), (12, 14), (12, 16), (12, 17), (15, 20), (21, 24), (24, 25), and (28, 30) end up throwing an exception with a message "Unable to complete outer nfp loop: 1".

Here is the performance comparison on pairs where libnfporb is the slowest compared to CGAL:

Run on (16 X 3874.93 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1280 KiB (x8)
  L3 Unified 24576 KiB (x1)
Load Average: 0.93, 0.71, 0.68
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27    4523011 ns      4520737 ns          155
benchmark_cgal/24/31    2876381 ns      2874814 ns          244
benchmark_cgal/24/26    1987337 ns      1986189 ns          352
benchmark_cgal/24/33    2306633 ns      2305401 ns          297
benchmark_cgal/23/25    1619837 ns      1619050 ns          430
benchmark_orb/24/27  5694047455 ns   5692760057 ns            1
benchmark_orb/24/31  3049576536 ns   3048878609 ns            1
benchmark_orb/24/26  1877847975 ns   1877413520 ns            1
benchmark_orb/24/33  2259959241 ns   2259457033 ns            1
benchmark_orb/23/25  1518002236 ns   1517642500 ns            1

Excellent work. I get the same results and it makes sense, given that libnfporb is stable but as I said has many workarounds for shortcomings of the mentioned paper.

intractabilis commented 2 years ago

6 seconds still sounds a bit excessive. I ran a microarchitecture benchmark. Your code issues 259,000,000 floating-point instructions for polygons 24 and 27. This amount of FLOP is enough to calculate FFT of a 512x512 image 16 times.

kallaballa commented 2 years ago

I don't know what to say except that I am glad I could get it to work despite the defects in the paper. But anyway, I am very interested in making it faster but I am not aware of low hanging fruits. As simple as it seems to orbit one polygon around another (well there are also holes, interlocks, etc.) it is really, really hard to get right. :) Btw. The majority of the CPU time is spent in boost::geometry functions (intersect, overlap, etc.)

intractabilis commented 2 years ago

Have you tried contacting the authors for comments? Emails are in the paper. Maybe they have some secret sauce.

kallaballa commented 2 years ago

Yep. No answer.

intractabilis commented 2 years ago

Well, I have to thank you for the great amount of work you've done to show this algorithm is impractical. The authors' silence probably means they know very well about the problems with their paper. If not your effort, I would most likely have wasted my time implementing it.

kallaballa commented 2 years ago

In my opinion there is no serious science without open source. especially in computer science that would mean no more bullshitting. I know that there are factors in play, like software-patents, that make the matter more complicated than that. But what is more important? software-patents or a trustworthy science? Anyway, I'm glad that my effort is helpful. Also, I was rewarded with two citations, in a bachelor and in a master thesis that used libnfporb. And if that isn't ironic enough, I don't even have an A-level. Thank you for your comprehension.

kallaballa commented 2 years ago

I get the same results and it makes sense, given that libnfporb is stable but as I said has many workarounds for shortcomings of the mentioned paper.

Uhm. I just wanted to profile 24/27 using perf (using the "profile" build target of the libnfporb Makefile) but first I tested with the release build and realized that it only took 2s for 1 iteration, which is remarkable since my machine is way less powerful than yours (as you'll see in the report following) . Anyway, when I run you benchmark suite i get the following result:

Run on (8 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
Load Average: 0.43, 0.51, 0.36
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27   61367284 ns     61366863 ns           10
benchmark_cgal/24/31   35310071 ns     35307597 ns           20
benchmark_cgal/24/26   24239106 ns     24235630 ns           29
benchmark_cgal/24/33   30250984 ns     30250819 ns           23
benchmark_cgal/23/25   19487795 ns     19486422 ns           36
benchmark_orb/24/27  49712943159 ns   49711742870 ns            1
benchmark_orb/24/31  31513661714 ns   31510963354 ns            1
benchmark_orb/24/26  18847734019 ns   18847256122 ns            1
benchmark_orb/24/33  22661594322 ns   22660991906 ns            1
benchmark_orb/23/25  15026137744 ns   15025408670 ns            1

49s for 1 iteration of 24/27. At the moment I am looking through the build system of the benchmark suite but and i found the following compile command:

c++ -Infp-orb-perf-check.p -I. -I.. -I../include -I/usr/local/include -I/usr/include -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wnon-virtual-dtor -std=gnu++20 -O0 -g -DFMT_LOCALE -DFMT_SHARED -DCGAL_USE_GMPXX=1 -DBOOST_FILESYSTEM_DYN_LINK=1 -DBOOST_ALL_NO_LIB -DDEBUG -march=native -mtune=native -Wno-deprecated-enum-enum-conversion -ffunction-sections -fdata-sections -fPIC -MD -MQ nfp-orb-perf-check.p/nfp-orb-perf-check.cpp.o -MF nfp-orb-perf-check.p/nfp-orb-perf-check.cpp.o.d -o nfp-orb-perf-check.p/nfp-orb-perf-check.cpp.o -c ../nfp-orb-perf-check.cpp

Which made me realize that the default buildtype debug. So I did a release build with following result:

2022-06-05T10:49:20+02:00
Running build/nfp-orb-perf-check
Run on (8 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
Load Average: 0.42, 0.52, 0.47
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27    4891004 ns      4891003 ns          136
benchmark_cgal/24/31    3130488 ns      3128706 ns          225
benchmark_cgal/24/26    2172407 ns      2172363 ns          322
benchmark_cgal/24/33    2533278 ns      2533082 ns          276
benchmark_cgal/23/25    1806403 ns      1806375 ns          355
benchmark_orb/24/27  3823282100 ns   3823111325 ns            1
benchmark_orb/24/31  2050453801 ns   2050392766 ns            1
benchmark_orb/24/26  1274274826 ns   1274209477 ns            1
benchmark_orb/24/33  1539979505 ns   1539962289 ns            1
benchmark_orb/23/25  1037740713 ns   1037673723 ns            1

which is still slower than 2s.

So i built the benchmark program by hand:

g++ -s -pthread -fno-strict-aliasing -L/usr/local/lib -I/usr/local/include -lboost_filesystem -lboost_system -lbenchmark -lgmp -lmpfr -std=c++20 -march=native -I../../src -DNDEBUG -g0 -Ofast -flto -o nfp-orb-perf-check ../nfp-orb-perf-check.cpp
and got the following result:
2022-06-05T11:35:43+02:00
Running ./build/nfp-orb-perf-check
Run on (8 X 1601.28 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
Load Average: 0.59, 0.59, 0.60
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27    4862135 ns      4862111 ns          145
benchmark_cgal/24/31    3097166 ns      3097099 ns          226
benchmark_cgal/24/26    2139914 ns      2139891 ns          327
benchmark_cgal/24/33    2501373 ns      2501171 ns          279
benchmark_cgal/23/25    1745026 ns      1745017 ns          400
benchmark_orb/24/27  2869975098 ns   2869954309 ns            1
benchmark_orb/24/31  1600250962 ns   1600240966 ns            1
benchmark_orb/24/26  1104291322 ns   1104281761 ns            1
benchmark_orb/24/33  1339996282 ns   1339987384 ns            1
benchmark_orb/23/25   895057826 ns    895049078 ns            1

Better but still not as fast as using libnfporb/examples/nfp. Anyway i can accept that difference for the moment because it isn't that much. Could you try doing a manual build on the latest libnfporb revision like I did and redo the benchmark?

In the meanwhile I'll go back to the code and see If i can squeeze out an order of magnitude or two :)

kallaballa commented 2 years ago

for the record I am using "g++ (SUSE Linux) 12.1.0"

kallaballa commented 2 years ago

I found an unnecessary call and now look at that (current HEAD):

2022-06-05T12:37:51+02:00
Running ./build/nfp-orb-perf-check
Run on (8 X 1427.49 MHz CPU s)
CPU Caches:
CXX      := g++
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
Load Average: 0.52, 0.60, 0.58
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27    4783350 ns      4780879 ns          148
benchmark_cgal/24/31    3035116 ns      3035073 ns          231
benchmark_cgal/24/26    2101172 ns      2101139 ns          331
benchmark_cgal/24/33    2455099 ns      2454937 ns          286
benchmark_cgal/23/25    1717539 ns      1717502 ns          407
benchmark_orb/24/27   445997757 ns    445992804 ns            2
benchmark_orb/24/31   267225518 ns    267222110 ns            3
benchmark_orb/24/26   195266817 ns    195265388 ns            4
benchmark_orb/24/33   237397545 ns    237396646 ns            3
benchmark_orb/23/25   165946719 ns    165945385 ns            4
kallaballa commented 2 years ago

also i looked into what the minkowski-sum appoach you are using is actually doing and found out that the benchmark isn't fair by far because it can't include perfect fits, antennas and interlocks which libnfporb does.

https://stackoverflow.com/questions/52350747/no-fit-polygon-problem-with-cgal-minkowski-sum-2

intractabilis commented 2 years ago

also i looked into what the minkowski-sum appoach you are using is actually doing and found out that the benchmark isn't fair by far because it can't include perfect fits, antennas and interlocks which libnfporb does.

https://stackoverflow.com/questions/52350747/no-fit-polygon-problem-with-cgal-minkowski-sum-2

libnfporb doesn't work on the mentioned in the post example

    auto Aorb = libnfporb::polygon_t{
        {{0, 0}, {10, 0}, {10, 10}, {8, 10}, {8, 2}, {2, 2}, {2, 10}, {0, 10}}, {}
    };
    auto Borb = libnfporb::polygon_t{
        {{0, 0}, {6, 0}, {6, 6}, {4, 6}, {4, 2}, {2, 2}, {2, 6}, {0, 6}}, {}
    };
    auto Corb = libnfporb::generate_nfp(Aorb, Borb, true);

It stops with

../../include/libnfporb.hpp:78: void libnfporb::remove_co_linear(boost::geometry::model::polygon<point_t, false, true>::ring_type&): Assertion `r.size() > 2' failed

Even if you fix this, it doesn't work on completely not exotic examples. It fails (probably intermittently) at least on 12 pairs of polygons I sent you. See my initial post.

intractabilis commented 2 years ago

I found an unnecessary call and now look at that (current HEAD):

2022-06-05T12:37:51+02:00
Running ./build/nfp-orb-perf-check
Run on (8 X 1427.49 MHz CPU s)
CPU Caches:
CXX      := g++
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
Load Average: 0.52, 0.60, 0.58
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27    4783350 ns      4780879 ns          148
benchmark_cgal/24/31    3035116 ns      3035073 ns          231
benchmark_cgal/24/26    2101172 ns      2101139 ns          331
benchmark_cgal/24/33    2455099 ns      2454937 ns          286
benchmark_cgal/23/25    1717539 ns      1717502 ns          407
benchmark_orb/24/27   445997757 ns    445992804 ns            2
benchmark_orb/24/31   267225518 ns    267222110 ns            3
benchmark_orb/24/26   195266817 ns    195265388 ns            4
benchmark_orb/24/33   237397545 ns    237396646 ns            3
benchmark_orb/23/25   165946719 ns    165945385 ns            4

It's still 2 orders of magnitude difference. However, it's an impressive 10 times gain. Maybe 2 more unnecessary calls, and we will be on par with CGAL? :)

intractabilis commented 2 years ago

So i built the benchmark program by hand:

I think the difference is because you are using long double, and my compilation asks the compiler to use SIMD. long double not only prevents using advanced CPU features, but also incurs extra work when the rest of the code uses SIMD.

kallaballa commented 2 years ago

Even if you fix this, it doesn't work on completely not exotic examples. It fails (probably intermittently) at least on 12 pairs of polygons I sent you. See my initial post.

Let's see what I can do.

kallaballa commented 2 years ago
g++ -s -pthread -fno-strict-aliasing -L/usr/local/lib -I/usr/local/include -lboost_filesystem -lboost_system -lbenchmark -lgmp -lmpfr -std=c++20 -march=native -I../../src -DNDEBUG -g0 -Ofast -flto -o nfp-orb-perf-check ../nfp-orb-perf-check.cpp

you are right, i forgot to tune. Now i tried with mtune=native and mtune=intel and native made it slower while intel made no difference.

intractabilis commented 2 years ago

you are right, i forgot to tune. Now i tried with mtune=native and mtune=intel and native made it slower while intel made no difference.

mtune should be irrelevant because it doesn't change the instruction set.

kallaballa commented 2 years ago

You are right! But that means I've configuring with SIMD in each of my examples because I used march=native

intractabilis commented 2 years ago

You are right! But that means I've configuring with SIMD in each of my examples because I used march=native

Ok, then I am wrong about long double.

YggSky commented 2 years ago

You are right! But that means I've configuring with SIMD in each of my examples because I used march=native

Ok, then I am wrong about long double.

are you test with boost nfp or clliper?clipper faster than cgal or to boost(mu qian zui hao de bi cgal kuai 2--3 bei)

intractabilis commented 2 years ago

are you test with boost nfp or clliper?clipper faster than cgal or to boost(mu qian zui hao de bi cgal kuai 2--3 bei)

As far as I know, clipper doesn't work with non-convex shapes. Boost doesn't have NFP.

YggSky commented 2 years ago

are you test with boost nfp or clliper?clipper faster than cgal or to boost(mu qian zui hao de bi cgal kuai 2--3 bei)

As far as I know, clipper doesn't work with non-convex shapes. Boost doesn't have NFP.

i'm sure you are not use…… clipper can wrok with non-convex,because it Convex decomposition and final across bool operate merage all Convex that is decomposed get non-convex NFP。

you just test can know. or you can see origin code ,that clearly explain what you question, why it can deal with non-convex shapes.

(ganjue nishige xin shou zai nfp fangmian)

intractabilis commented 2 years ago

Decomposition cannot be faster than reduced convolution, unless it's something very trivial. Show comparative benchmarks on non-convex polygons with a hundred of vertices.

YggSky commented 2 years ago

Decomposition cannot be faster than reduced convolution, unless it's something very trivial. Show comparative benchmarks on non-convex polygons with a hundred of vertices.

but it reall fast than cgal。probably,i am not understand it true theory。or it use efficiency deal with non-convex polygons compare to cgal。

intractabilis commented 2 years ago

Friend, I am ready to believe you, but just post your benchmarks. In my understanding, decomposition cannot be faster than the reduced convolution, besides trivial cases. Present your benchmarks for non-convex polygons with a hundred or so vertices.

YggSky commented 2 years ago

Friend, I am ready to believe you, but just post your benchmarks. In my understanding, decomposition cannot be faster than the reduced convolution, besides trivial cases. Present your benchmarks for non-convex polygons with a hundred or so vertices.

could give me your test data?upload git is better.

intractabilis commented 2 years ago

Absolutely, I can give you even a benchmark program. Unfortunately, I cannot publish the data because of the legal issues. Any email?

kallaballa commented 2 years ago

btw. I will be back for this issue, just other things in the pipeline. :)

YggSky commented 2 years ago

Absolutely, I can give you even a benchmark program. Unfortunately, I cannot publish the data because of the legal issues. Any email?

ok ,could give the benchmark program?

intractabilis commented 2 years ago

I have an idea. I'm going to make some modifications to read polygons usually used in papers on NFP, and will publish everything here. Hang on.

intractabilis commented 2 years ago

Here: nfp-orb-perf-check.tar.gz

YggSky commented 2 years ago

Here: nfp-orb-perf-check.tar.gz

thanks

intractabilis commented 2 years ago

I created my own implementation of reduced convolution, it is 6 times faster than CGAL's.

intractabilis commented 2 years ago

I managed to make my implementation 9 times faster than CGAL.

YggSky commented 1 year ago

I managed to make my implementation 9 times faster than CGAL.

sure, i test with bird.poly(you send me the data, 275 points) it cost 2ms finish to calculate. i think it to slow,i plan to reduce to 0.2 ms. with reduced convolution,

kallaballa commented 1 year ago

There are way better alternatives out by now that is why I'm going to archive this repository