Closed simon-king-jena closed 9 years ago
There are various possible more or less obvious data formats:
char*
or unsigned int*
, hence, as a list of integers, each integer corresponding to one edge of the quiver. char*
would bound the total number of edges of the quiver to 256 (which might be too small), whereas unsigned int*
might consume to much memory.char*
, each integer corresponding to an outgoing edge of a vertex. Hence, the total number of edges may be arbitrary, we only have a local bound of 128 on the number of outgoing edges---this hopefully is enough (it certainly is enough for me).In 2. and 4. (and probably in 1. and 3. as well) we should explicitly store the start and end point of the path. The length should probably be stored as well.
In my applications, the following operations on paths are needed to be fast: Concatenation, hash/comparison, testing whether a path p starts with a path q, and less important whether a path p contains a path q.
Concatenation is particularly easy in 1. and 2. (just concatenation of lists, which is achieved by memcopy). It is a bit more difficult in 3. and 4., where the bytes of the second factor need to be shifted to get the compressed data seamless.
Hash and comparison are easy in all four formats, but of course best speed would be obtained in 4., simply since the amount of data that needs to be processed is smaller than in 1.-3.
In all four proposed data formats, it is easy to test whether path p starts with path q. Again, having the data stored in as little memory as possible will speed things up.
It is easy in 1. to test whether path p contains path q (as a subpath that is not necessarily an initial segment), whereas this becomes a bit more difficult in 2.-4.
What format would you suggest? Perhaps it makes sense to have some initial implementations of all four, and then compare directly.
I'd say:
But that's just my 2 cents ... I don't have specific experience!
Cheers,
Hi Nicolas,
Replying to @nthiery:
- Some implementation without limitations (the one from the previous ticket I guess)
An implementation which represents edges by unsigned long long
can be considered "unlimited".
- Whichever of 1-4 that covers your needs (256 edges for Gröbner bases calculations this should be already quite big, right?)
Yes, but I want a general code.
- As much as possible of the code shared between all implementations to make it easy to add new ones if deemed useful.
Actually, on the level of paths, the implementations come in pairs: The compressed representations differ from the non-compressed---at the end of the day, we just have different ways of representing sequences of integers. Only when it comes to interpreting a sequence of integers as a path in a quiver will we have further differences: Does "[1,2]" mean "start with arrow 1 of the quiver and continue with arrow 2 of the quiver", or does it mean "travel along the outgoing arrow 1 of the vertex we are currently located at, and then continue with outgoing arrow 2 of the vertex we just arrived at".
Currently, I prefer a compressed representation by unsigned long long*
. Rationale:
unsigned long long
(a larger quiver won't fit into computer's memory anyway).unsigned long long
. That's quite efficient, both memory- and speed-wise.A bit later I will post experimental code (not yet talking about quiver paths but about sequences of integers) giving some evidence on efficiency. And then we can still see how we interpret the integer sequences: Having a global or a local list of arrows?
I have attached a toy (or proof-of-concept) implementation; see attachment: path_test.pyx. In a real implementation in
compressed representation, it would be the job of a PathSemigroup
to find
out how many edges fit into one word. And of course, in the toy implementation
I am not trying to translate an integer sequence into the string
representation of an element of the path semigroup.
What implementation would you suggest to use in "reality"? Is there another implementation of "sequences of bounded integers" that you would recommend? And would you enumerate the arrows of a quiver locally (i.e., only number the outgoing edges at each vertex) or globally? The former will result in better compression, but would make it more difficult to test for paths p,q whether there are paths r,s with p==r*q*s
.
Now for the timings, and sorry for the long post...
Here, I am timing concatenation, hash, comparison, comparing initial segments. In all examples, we use the following paths of various sizes.
p5 = Path(ZZ, 5, range(5), 1, 1)
p25 = Path(ZZ, 25, range(25), 1, 1)
p50 = Path(ZZ, 50, range(50), 1, 1)
p100 = Path(ZZ, 100, range(100), 1, 1)
p1000 = Path(ZZ, 1000, range(1000), 1, 1)
q5 = Path(ZZ, 5, range(1,6), 1, 1)
q25 = Path(ZZ, 25, range(1,26), 1, 1)
q50 = Path(ZZ, 50, range(1,51), 1, 1)
q100 = Path(ZZ, 100, range(1,101), 1, 1)
q1000 = Path(ZZ, 1000, range(1,1001), 1, 1)
r5 = Path(ZZ, 5, range(1,5)+[0], 1, 1)
r25 = Path(ZZ, 25, range(1,25)+[0], 1, 1)
r50 = Path(ZZ, 50, range(1,50)+[0], 1, 1)
r100 = Path(ZZ, 100, range(1,100)+[0], 1, 1)
r1000 = Path(ZZ, 1000, range(1,1000)+[0], 1, 1)
s5 = Path(ZZ, 5, range(5), 1, 1)
s25 = Path(ZZ, 25, range(25), 1, 1)
s50 = Path(ZZ, 50, range(50), 1, 1)
s100 = Path(ZZ, 100, range(100), 1, 1)
s1000 = Path(ZZ, 1000, range(1000), 1, 1)
We are doing the following tests.
Concatenation:
%timeit p5*q5
%timeit p5*q25
%timeit p25*q25
%timeit p25*q50
%timeit p50*q50
%timeit p50*q100
%timeit p100*q100
%timeit p100*q1000
%timeit p1000*q1000
Hash:
%timeit hash(p5)
%timeit hash(p25)
%timeit hash(p50)
%timeit hash(p100)
%timeit hash(p1000)
Comparison:
%timeit p5==s5
%timeit p25==s25
%timeit p50==s50
%timeit p100==s100
%timeit p1000==s1000
%timeit p5==r5
%timeit p25==r25
%timeit p50==r50
%timeit p100==r100
%timeit p1000==r1000
%timeit q5==r5
%timeit q25==r25
%timeit q50==r50
%timeit q100==r100
%timeit q1000==r1000
Startswith:
%timeit q1000.startswith(q100) # it does start with
%timeit q1000.startswith(r100) # it doesn't start with
Uncompressed integer lists
Here, we define Path=Path_v1
(see attachment: path_test.pyx).
Using ctypedef unsigned long block
sage: %timeit p5*q5
1000000 loops, best of 3: 739 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 769 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 801 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 810 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 885 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 959 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 1.02 us per loop
sage: %timeit p100*q1000
100000 loops, best of 3: 1.99 us per loop
sage: %timeit p1000*q1000
100000 loops, best of 3: 3.06 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 135 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 166 ns per loop
sage: %timeit hash(p50)
1000000 loops, best of 3: 214 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 333 ns per loop
sage: %timeit hash(p1000)
100000 loops, best of 3: 2.09 us per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 1.63 us per loop
sage: %timeit p25==s25
100000 loops, best of 3: 5.91 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 11 us per loop
sage: %timeit p100==s100
10000 loops, best of 3: 23 us per loop
sage: %timeit p1000==s1000
1000 loops, best of 3: 212 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 634 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 659 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 659 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 686 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 698 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 1.65 us per loop
sage: %timeit q25==r25
100000 loops, best of 3: 5.83 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 11.5 us per loop
sage: %timeit q100==r100
10000 loops, best of 3: 22 us per loop
sage: %timeit q1000==r1000
1000 loops, best of 3: 213 us per loop
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 325 ns per loop
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 311 ns per loop
Using ctypedef unsigned short block
sage: %timeit p5*q5
1000000 loops, best of 3: 700 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 738 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 820 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 769 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 808 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 877 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 909 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.49 us per loop
sage: %timeit p1000*q1000
100000 loops, best of 3: 2.13 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 138 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 179 ns per loop
sage: %timeit hash(p50)
1000000 loops, best of 3: 236 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 383 ns per loop
sage: %timeit hash(p1000)
100000 loops, best of 3: 2.61 us per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 1.08 us per loop
sage: %timeit p25==s25
100000 loops, best of 3: 3.42 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 6.53 us per loop
sage: %timeit p100==s100
100000 loops, best of 3: 14.9 us per loop
sage: %timeit p1000==s1000
10000 loops, best of 3: 137 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 613 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 591 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 602 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 603 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 635 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 1.15 us per loop
sage: %timeit q25==r25
100000 loops, best of 3: 3.63 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 6.45 us per loop
sage: %timeit q100==r100
100000 loops, best of 3: 12.8 us per loop
sage: %timeit q1000==r1000
10000 loops, best of 3: 140 us per loop
sage: q1000.startswith(q100)
True
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 336 ns per loop
sage: q1000.startswith(r100)
False
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 325 ns per loop
Using ctypedef unsigned char block
sage: %timeit p5*q5
1000000 loops, best of 3: 679 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 693 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 725 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 760 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 772 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 761 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 790 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.1 us per loop
sage: %timeit p1000*q1000
1000000 loops, best of 3: 1.31 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 133 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 176 ns per loop
sage: %timeit hash(p50)
1000000 loops, best of 3: 234 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 374 ns per loop
sage: %timeit hash(p1000)
100000 loops, best of 3: 2.53 us per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 1.08 us per loop
sage: %timeit p25==s25
100000 loops, best of 3: 3.36 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 6.3 us per loop
sage: %timeit p100==s100
100000 loops, best of 3: 12.5 us per loop
sage: %timeit p1000==s1000
10000 loops, best of 3: 121 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 581 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 589 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 599 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 599 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 636 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 1.06 us per loop
sage: %timeit q25==r25
100000 loops, best of 3: 3.46 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 6.51 us per loop
sage: %timeit q100==r100
100000 loops, best of 3: 12.7 us per loop
sage: %timeit q1000==r1000
10000 loops, best of 3: 121 us per loop
sage: q1000.startswith(q100)
True
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 324 ns per loop
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 315 ns per loop
Compressed integer lists
Here, we define Path=Path_v2
(see attachment: path_test.pyx), and do
sage: SizeManager.set_edge_number(27)
so that there is a total of only 27 edges (resp. 27 is the maximal number of
outgoing arrows on a vertex). In particular, all integers in the sequences
below are taken "mod 27". Since 27 has length 5 bits, a compressed
representation using unsigned char*
won't make sense. Hence, I am only
testing long and short.
Using ctypedef unsigned long block
sage: %timeit p5*q5
1000000 loops, best of 3: 676 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 671 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 683 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 725 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 759 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 787 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 811 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.44 us per loop
sage: %timeit p1000*q1000
1000000 loops, best of 3: 1.61 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 132 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 134 ns per loop
sage: %timeit hash(p50)
10000000 loops, best of 3: 142 ns per loop
sage: %timeit hash(p100)
10000000 loops, best of 3: 155 ns per loop
sage: %timeit hash(p1000)
1000000 loops, best of 3: 466 ns per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 710 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 1.71 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 2.6 us per loop
sage: %timeit p100==s100
100000 loops, best of 3: 4.4 us per loop
sage: %timeit p1000==s1000
10000 loops, best of 3: 38 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 692 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 715 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 724 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 729 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 752 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 706 ns per loop
sage: %timeit q25==r25
1000000 loops, best of 3: 1.69 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 2.59 us per loop
sage: %timeit q100==r100
100000 loops, best of 3: 4.4 us per loop
sage: %timeit q1000==r1000
10000 loops, best of 3: 37.6 us per loop
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 272 ns per loop
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 257 ns per loop
Using ctypedef unsigned short block
sage: %timeit p5*q5
1000000 loops, best of 3: 718 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 682 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 710 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 740 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 722 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 781 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 858 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.98 us per loop
sage: %timeit p1000*q1000
100000 loops, best of 3: 2.17 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 132 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 148 ns per loop
sage: %timeit hash(p50)
10000000 loops, best of 3: 168 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 206 ns per loop
sage: %timeit hash(p1000)
1000000 loops, best of 3: 951 ns per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 723 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 1.77 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 2.93 us per loop
sage: %timeit p100==s100
100000 loops, best of 3: 5.32 us per loop
sage: %timeit p1000==s1000
10000 loops, best of 3: 47.5 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 577 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 586 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 599 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 588 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 607 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 754 ns per loop
sage: %timeit q25==r25
1000000 loops, best of 3: 1.87 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 2.89 us per loop
sage: %timeit q100==r100
100000 loops, best of 3: 5.28 us per loop
sage: %timeit q1000==r1000
10000 loops, best of 3: 47.3 us per loop
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 272 ns per loop
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 256 ns per loop
It turns out that there are bugs in the toy code. But that's no surprise...
I have fixed the problems in attachment: path_test.pyx.
It was:
The latter involves an additional bitwise "and". So, probably the timings will slightly deteriorate.
There is one other variation that I want to test.
Currently, if the number of edges fits into 5 bits, then a 64 bit word is filled with data of 12 edges, plus 4 bits of garbage. The garbage must always be zero bits, for otherwise the shift operations would pollute real data with the garbage bits. To keep the garbage zero involves additional operations.
Alternatively, one could try to put fewer arrows into one word, so that there are no garbage bits. This could be done by encoding each arrow by a number of bits that is a power of 2. Hence, in the setting above, one would encode each edge by 8 rather than 5 bits, fitting 8 arrows into one 64 bit word. It is less dense, however the code gets simpler and should have less overhead.
I have replaced attachment: path_test.pyx again, adding two more implementations.
Do you have further suggestions on storage of lists of bounded integers? If you haven't, then I'll try to assess the timings stated in the various comments, according to what I expect to be useful in my applications.
Here are details on the two new implementations:
In Path_v3
, I use Sage's Integer
class as bit list. Sounds crazy, but it works rather well. I assume that bitshifts and comparison (and that's what we are using here) are internally written in assembler, and for us, the additional benefit is that we do not need to take care about the size of blocks (8 bit? 32 bit? 64 bit? 128 bit?) in which we store stuff---Integer
does it for us. Perhaps the timings could be improved further by using the underlying c-library (GMP's mpz_t
) directly.
In Path_v4
, I provide a slightly simplified version of Path_v2
: The data is compressed, but so that the chunks used to store one arrow fit seamlessly into a memory block. Hence, when we use 64-bit blocks, and have 27 arrows, then Path_v2
would store the arrows in chunks of 5 bit (hence, it can store 12 arrows in one memory block, leaving 4 bit empty), whereas Path_v4
would store the arrows in chunks of 8 bit (hence, it can only store 8 arrows in one memory block, but since this fits neatly into one memory block, the code can be simplified).
The timings for Path_v3
:
sage: %timeit p5*q5
1000000 loops, best of 3: 645 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 631 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 635 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 1.23 us per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 1.82 us per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 1.83 us per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 1.37 us per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.82 us per loop
sage: %timeit p1000*q1000
1000000 loops, best of 3: 1.78 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 164 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 197 ns per loop
sage: %timeit hash(p50)
1000000 loops, best of 3: 210 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 274 ns per loop
sage: %timeit hash(p1000)
1000000 loops, best of 3: 1.94 us per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 667 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 638 ns per loop
sage: %timeit p50==s50
1000000 loops, best of 3: 693 ns per loop
sage: %timeit p100==s100
1000000 loops, best of 3: 658 ns per loop
sage: %timeit p1000==s1000
1000000 loops, best of 3: 817 ns per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 746 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 723 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 769 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 731 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 733 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 710 ns per loop
sage: %timeit q25==r25
1000000 loops, best of 3: 733 ns per loop
sage: %timeit q50==r50
1000000 loops, best of 3: 718 ns per loop
sage: %timeit q100==r100
1000000 loops, best of 3: 719 ns per loop
sage: %timeit q1000==r1000
1000000 loops, best of 3: 755 ns per loop
sage: %timeit q1000.startswith(q100)
100000 loops, best of 3: 2.46 us per loop
sage: %timeit q1000.startswith(r100)
100000 loops, best of 3: 2.45 us per loop
The timings for Path_v4
:
sage: %timeit p5*q5
1000000 loops, best of 3: 730 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 762 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 806 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 849 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 854 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 931 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 930 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.97 us per loop
sage: %timeit p1000*q1000
1000000 loops, best of 3: 1.45 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 136 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 148 ns per loop
sage: %timeit hash(p50)
10000000 loops, best of 3: 159 ns per loop
sage: %timeit hash(p100)
10000000 loops, best of 3: 182 ns per loop
sage: %timeit hash(p1000)
1000000 loops, best of 3: 600 ns per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 700 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 1.65 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 2.48 us per loop
sage: %timeit p100==s100
100000 loops, best of 3: 4.21 us per loop
sage: %timeit p1000==s1000
10000 loops, best of 3: 36.4 us per loop
sage: %timeit p5==r5
100000 loops, best of 3: 706 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 764 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 740 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 742 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 822 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 693 ns per loop
sage: %timeit q25==r25
1000000 loops, best of 3: 1.6 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 2.54 us per loop
sage: %timeit q100==r100
100000 loops, best of 3: 4.27 us per loop
sage: %timeit q1000==r1000
10000 loops, best of 3: 37.1 us per loop
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 255 ns per loop
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 243 ns per loop
Hmmm. Perhaps I should have a look at sage/misc/bitset.pxi
.
I couldn't directly use sage/misc/bitset
, but I could learn from it: I have improved the Path_v4
implementation (here, the arrows will be encoded by chunks of memory whose size is a power of 2). This was possible by using memcmp
in some place (not in the method startswith()
, where a loop over the data is faster than memcmp) and, in particular, use bitshift operations to replace multiplications and integer divisions (this is only possible with Path_v4
, since here the factors are powers of 2 and can thus correspond to shifts).
New timings with Path=Path_v4
and SizeManager.set_edge_number(27)
:
sage: %timeit p5*q5
1000000 loops, best of 3: 684 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 726 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 798 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 862 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 827 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 924 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 915 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.91 us per loop
sage: %timeit p1000*q1000
1000000 loops, best of 3: 1.42 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 148 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 158 ns per loop
sage: %timeit hash(p50)
10000000 loops, best of 3: 171 ns per loop
sage: %timeit hash(p100)
10000000 loops, best of 3: 194 ns per loop
sage: %timeit hash(p1000)
1000000 loops, best of 3: 594 ns per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 498 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 547 ns per loop
sage: %timeit p50==s50
1000000 loops, best of 3: 631 ns per loop
sage: %timeit p100==s100
1000000 loops, best of 3: 712 ns per loop
sage: %timeit p1000==s1000
100000 loops, best of 3: 2.42 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 499 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 514 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 549 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 526 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 545 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 506 ns per loop
sage: %timeit q25==r25
1000000 loops, best of 3: 544 ns per loop
sage: %timeit q50==r50
1000000 loops, best of 3: 589 ns per loop
sage: %timeit q100==r100
1000000 loops, best of 3: 704 ns per loop
sage: %timeit q1000==r1000
100000 loops, best of 3: 2.48 us per loop
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 263 ns per loop
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 232 ns per loop
PS: I chose 27 edges, since this should be particularly bad for Path_v4
. Indeed, in Path_v2
, 12 arrows can be put into one uint64_t
, whereas in Path_v4
only 8 arrows fit into one uint64_t
. So, if Path_v4
turns out to be faster than Path_v2
with 27 edges, then it is likely to be generally faster, I think.
From my side, this is the last part of the proof-of-concept implementations, before assessing the benchmark tests and choosing an implementation for the elements of path semigroups...
In Path_v5
, I am using the GMP library directly, thus avoiding some overhead compared with Path_v3
. The timings I get (with Path=Path_v5
and SizeManager.set_edge_number(27)
):
sage: %timeit p5*q5
1000000 loops, best of 3: 720 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 814 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 823 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 808 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 828 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 890 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 866 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.18 us per loop
sage: %timeit p1000*q1000
1000000 loops, best of 3: 1.38 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 159 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 171 ns per loop
sage: %timeit hash(p50)
1000000 loops, best of 3: 198 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 266 ns per loop
sage: %timeit hash(p1000)
1000000 loops, best of 3: 1.93 us per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 466 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 509 ns per loop
sage: %timeit p50==s50
1000000 loops, best of 3: 495 ns per loop
sage: %timeit p100==s100
1000000 loops, best of 3: 485 ns per loop
sage: %timeit p1000==s1000
1000000 loops, best of 3: 693 ns per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 451 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 468 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 463 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 468 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 482 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 448 ns per loop
sage: %timeit q25==r25
1000000 loops, best of 3: 462 ns per loop
sage: %timeit q50==r50
1000000 loops, best of 3: 448 ns per loop
sage: %timeit q100==r100
1000000 loops, best of 3: 457 ns per loop
sage: %timeit q1000==r1000
1000000 loops, best of 3: 470 ns per loop
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 673 ns per loop
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 657 ns per loop
Hmmmmmmm... Well, looks like the last one is the best, even though it is a bit slow for hashing. If the sizeof a block is a power of two it should be quite good, don't you think ?
This being said, perhaps it should be implemented like the bitset stuff, in a algebraic-free file, to store sequences of integers as you said.
How do you define your hashing function and your comparison function, by the way ?
Nathann
Replying to @nathanncohen:
Hmmmmmmm... Well, looks like the last one is the best, even though it is a bit slow for hashing.
What do you mean by "the last one"? Using GMP's mpz_t
directly, which is Path_v5
in the attachment? Or the improved "semicompressed" version of uint64_t*
(storing chunks of size 2^n
rather then in the smallest possible chunk size), which is Path_v4
?
If the sizeof a block is a power of two it should be quite good, don't you think ?
Should be. My own summary of the timings above would be this. We have 6 benchmarks, and in the following list I am giving for each benchmark a small, a medium and a large example. In the end, +
or -
indicate whether the respective implementation did well/poorly in the three examples.
5x5 50x50 1000x1000
Uncompressed long : 739 ns 885 ns 3.06 us - -
Uncompressed short: 700 ns 808 ns 2.13 us
Uncompressed char : 679 ns 772 ns 1.31 us +
Compressed long : 676 ns 759 ns 1.61 us
Compressed short: 718 ns 722 ns 1.98 us +
Semicompr. long : 684 ns 827 ns 1.42 us
Integer : 645 ns 1.82 us 1.78 us +-
GMP mpz_t: 723 ns 799 ns 1.41 us
5 100 1000
Uncompressed long : 135 ns 333 ns 2.09 us +
Uncompressed short: 138 ns 383 ns 2.61 us +--
Uncompressed char : 133 ns 374 ns 2.53 us +
Compressed long : 132 ns 155 ns 466 ns +++
Compressed short: 132 ns 206 ns 951 ns +
Semicompr. long : 148 ns 194 ns 594 ns
Integer : 164 ns 274 ns 1.94 us -
GMP mpz_t: 159 ns 266 ns 1.93 us
5 100 1000
Uncompressed long : 1.63 us 23 us 212 us ---
Uncompressed short: 1.08 us 14.9 us 137 us
Uncompressed char : 1.08 us 12.5 us 121 us
Compressed long : 710 ns 4.4 us 38 us
Compressed short: 723 ns 5.3 us 47.5 us
Semicompr. long : 498 ns 712 ns 2.4 us
Integer : 667 ns 658 ns 817 ns
GMP mpz_t: 466 ns 485 ns 693 ns +++
5 100 1000
Uncompressed long : 634 ns 686 ns 698 ns
Uncompressed short: 613 ns 603 ns 635 ns
Uncompressed char : 581 ns 599 ns 636 ns
Compressed long : 692 ns 729 ns 752 ns --
Compressed short: 577 ns 588 ns 607 ns
Semicompr. long : 499 ns 526 ns 545 ns
Integer : 746 ns 731 ns 733 ns --
GMP mpz_t: 451 ns 468 ns 482 ns +++
5 100 1000
Uncompressed long : 1.65 us 22 us 213 us ---
Uncompressed short: 1.15 us 12.8 us 140 us
Uncompressed char : 1.06 us 12.7 us 121 us
Compressed long : 706 ns 4.4 us 37.6 us
Compressed short: 754 ns 5.3 us 47.3 us
Semicompr. long : 506 ns 704 ns 2.5 us
Integer : 710 ns 719 ns 755 ns
GMP mpz_t: 448 ns 457 ns 470 ns +++
yes no
Uncompressed long : 325 ns 311 ns
Uncompressed short: 336 ns 325 ns
Uncompressed char : 324 ns 315 ns
Compressed long : 272 ns 257 ns +
Compressed short: 272 ns 256 ns +
Semicompr. long : 263 ns 232 ns ++
Integer : 2.46 us 2.45 us --
GMP mpz_t: 673 ns 657 ns
So, I don't think there is a clear winner in the competition.
This being said, perhaps it should be implemented like the bitset stuff, in a algebraic-free file, to store sequences of integers as you said.
Yes, it would make sense.
How do you define your hashing function and your comparison function, by the way ?
Depends on the implementation. Of course, when I use Integer
or mpz_t
, I am using the hash function provided by GMP. When I implement a block*
, where block
is either char
, uint32_t
or uint64_t
, I am using some hash function for lists that I found in the internet a while ago (I don't recall where). It is as follows:
def __hash__(self):
cdef block h
cdef block* ptr
h = <block>(self.start^(self.end<<16)^(self._len))
ptr = <block *>self.data
cdef size_t i
h ^= FNV_offset
for i from 0<=i<self._len:
h ^= ptr[i]
h *= FNV_prime
if h==-1:
return -2
return h
where FNV_prime
and FNV_offset
depend on the architecture. I am surprised that this is faster than GMP's hash function.
Description changed:
---
+++
@@ -1,4 +1,4 @@
#12630 has introduced tools to compute with quiver algebras and quiver representations. It is all Python code, paths being internally stored by lists of labelled edges of digraphs.
-The aim of this ticket is to provide a more efficient implementation of the (partial) semigroup that is formed by the paths of a quiver, multiplied by concatenation.
+The aim of this ticket is to provide an efficient implementation of lists of bounded integers, with concatenation. This can be used to implement paths of a quiver more efficiently, but can also have different applications (say, elements of free groups, or other applications in sage-combinat).
As Nathann has mentioned, it would make sense to provide a general implementation of sequences of bounded integers (independent of quiver paths), and then quiver paths can inherit from it later. That's why I changed the ticket description.
Changed keywords from quiver path algebra efficiency to sequence bounded integer
Changed dependencies from #12630 to none
Attachment: path_test.pyx.gz
A proof of concept
For completeness, I have added Path_v6
, which uses Python's tuple
under the hood (which is an obvious way to implement paths, too).
Result:
sage: %timeit p5*q5
1000000 loops, best of 3: 439 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 542 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 612 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 869 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 1.1 us per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 1.39 us per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 1.41 us per loop
sage: %timeit p100*q1000
100000 loops, best of 3: 5.04 us per loop
sage: %timeit p1000*q1000
100000 loops, best of 3: 11 us per loop
sage: %timeit hash(p5)
1000000 loops, best of 3: 245 ns per loop
sage: %timeit hash(p25)
1000000 loops, best of 3: 532 ns per loop
sage: %timeit hash(p50)
1000000 loops, best of 3: 884 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 1.63 us per loop
sage: %timeit hash(p1000)
100000 loops, best of 3: 14.6 us per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 885 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 1.71 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 2.63 us per loop
sage: %timeit p100==s100
100000 loops, best of 3: 4.53 us per loop
sage: %timeit p1000==s1000
10000 loops, best of 3: 39.3 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 785 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 829 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 799 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 791 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 822 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 1.48 us per loop
sage: %timeit q25==r25
100000 loops, best of 3: 3.72 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 6.63 us per loop
sage: %timeit q100==r100
100000 loops, best of 3: 12.5 us per loop
sage: %timeit q1000==r1000
10000 loops, best of 3: 116 us per loop
sage: %timeit q1000.startswith(q100)
100000 loops, best of 3: 4.7 us per loop
sage: %timeit q1000.startswith(r100)
100000 loops, best of 3: 4.72 us per loop
The comparison with the other implementations becomes this:
5x5 50x50 1000x1000
Uncompressed long : 739 ns 885 ns 3.06 us - -
Uncompressed short: 700 ns 808 ns 2.13 us
Uncompressed char : 679 ns 772 ns 1.31 us +
Compressed long : 676 ns 759 ns 1.61 us
Compressed short: 718 ns 722 ns 1.98 us +
Semicompr. long : 684 ns 827 ns 1.42 us
Integer : 645 ns 1.82 us 1.78 us -
GMP mpz_t: 723 ns 799 ns 1.41 us
tuple : 439 ns 1.1 us 11 us + -
5 100 1000
Uncompressed long : 135 ns 333 ns 2.09 us +
Uncompressed short: 138 ns 383 ns 2.61 us +--
Uncompressed char : 133 ns 374 ns 2.53 us +
Compressed long : 132 ns 155 ns 466 ns +++
Compressed short: 132 ns 206 ns 951 ns +
Semicompr. long : 148 ns 194 ns 594 ns
Integer : 164 ns 274 ns 1.94 us
GMP mpz_t: 159 ns 266 ns 1.93 us
tuple : 245 ns 1.63 us 14.6 us ---
5 100 1000
Uncompressed long : 1.63 us 23 us 212 us ---
Uncompressed short: 1.08 us 14.9 us 137 us
Uncompressed char : 1.08 us 12.5 us 121 us
Compressed long : 710 ns 4.4 us 38 us
Compressed short: 723 ns 5.3 us 47.5 us
Semicompr. long : 498 ns 712 ns 2.4 us
Integer : 667 ns 658 ns 817 ns
GMP mpz_t: 466 ns 485 ns 693 ns +++
tuple : 885 ns 4.5 us 39.3 us
5 100 1000
Uncompressed long : 634 ns 686 ns 698 ns
Uncompressed short: 613 ns 603 ns 635 ns
Uncompressed char : 581 ns 599 ns 636 ns
Compressed long : 692 ns 729 ns 752 ns
Compressed short: 577 ns 588 ns 607 ns
Semicompr. long : 499 ns 526 ns 545 ns
Integer : 746 ns 731 ns 733 ns
GMP mpz_t: 451 ns 468 ns 482 ns +++
tuple : 785 ns 791 ns 822 ns ---
5 100 1000
Uncompressed long : 1.65 us 22 us 213 us ---
Uncompressed short: 1.15 us 12.8 us 140 us
Uncompressed char : 1.06 us 12.7 us 121 us
Compressed long : 706 ns 4.4 us 37.6 us
Compressed short: 754 ns 5.3 us 47.3 us
Semicompr. long : 506 ns 704 ns 2.5 us
Integer : 710 ns 719 ns 755 ns
GMP mpz_t: 448 ns 457 ns 470 ns +++
tuple : 1.48 us 12.5 us 116 us
yes no
Uncompressed long : 325 ns 311 ns
Uncompressed short: 336 ns 325 ns
Uncompressed char : 324 ns 315 ns
Compressed long : 272 ns 257 ns +
Compressed short: 272 ns 256 ns +
Semicompr. long : 263 ns 232 ns ++
Integer : 2.46 us 2.45 us
GMP mpz_t: 673 ns 657 ns
tuple : 4.7 us 4.72 us --
So, using tuples is best for concatenation of short paths, but that's the only case where it is not too slow.
Well, given the outcomes I guess it mostly depends on your own practical applications ?... :-)
I also wonder (because I hate them) how much time is spent dealing with parents. But all your implementations have to go through that at the moment so it does not change your comparisons.
Well, I am back to work at long long last. With a computer AND a screen. And in Brussels, which counts for the good mood :-P
Have fuuuuuuuuuuuuuun !
Nathann
Hi Nathann,
Replying to @nathanncohen:
I also wonder (because I hate them) how much time is spent dealing with parents.
Well, at some point parents have to come into play, namely when I want to implement not just sequences of bounded integers, but elements of the path semigroup of a quiver.
The strategical question is: Should the data defining a path be stored in an attribute of that path (the attribute would be an instance of BoundedIntegerSequence
), or should the path itself be an instance of BoundedIntegerSequence
? In the former case, it would be perfectly fine to have sequences of bounded integers parent-less; but I worry about an overhead of accessing the attribute. In the latter case, BoundedIntegerSequence
should already be a sub-class of sage.structure.element.Element
(and have a parent), since Cython does not allow multiple inheritance.
I tend to the latter solution also for a different reason: Instances of BoundedIntegerSequence
have to know the bound for their elements, because the data storage depends on that bound. For efficiency (see attachment), several other constants are derived from that bound, and used in the algorithms.
Should each instance of BoundedIntegerSequence
know about all these bounds? Or should there be a common parent for all sequences of integers that are bounded by a number B?
I tend to the latter. If you assign several ints (the afore-mentioned constants) to an instance of BoundedIntegerSequence
, it will likely be more time consuming (during initialisation) than to store just a single data (namely the parent).
That said, looking up the constants via the parent would be more time-consuming than looking up constants that are stored directly in the element.
Difficult....
Hellooooo Simon !
Hmmmm... All this is important to settle before implementing those BoundedIntegerSequences objects somewhere indeed.
What I thought you intended was to write a very low-level data structure somewhere in the misc/ folder. The kind of stuff that one does not use by mistake, the kind of stuff that ones does not use without reading the manual first. So I thought it would not exactly check its input, trusting blindly the developer and being as efficient as possible.
Then, I thought you would have a higher-level algebraic stuff with parents colors and blinking lights that would check their input indeed, and be what we want a user-friendly object to be.
But I actually have only one question: I don't know how you intend to use all this in the end, but I got the impression that several functions may have to generate a LOT of paths in order to compute some small data, and throw all the paths away once the computations have ended. Do you want do create high-level objects in those functions or not ?
If you don't, then perhaps you would be better with only low-level objects that you know how to use, which would not have to waste time (if time is actually wasted) on high-level problems.
And so I thought that you high-level objects would have a variable which would be this low-level object, etc, etc...
What do you have in mind ?
Nathann
Hi Nathann,
Replying to @nathanncohen:
What I thought you intended was to write a very low-level data structure somewhere in the misc/ folder.
I somehow do.
The kind of stuff that one does not use by mistake,
Why? It should be a replacement for tuples (thus, low-level), but it should ideally be (close to) a drop-in replacement (thus, it doesn't matter whether one uses it by mistake).
Then, I thought you would have a higher-level algebraic stuff with parents colors and blinking lights that would check their input indeed, and be what we want a user-friendly object to be.
Again ideally: Blinking colours should only be added when one really implements paths in quivers.
But I actually have only one question: I don't know how you intend to use all this in the end, but I got the impression that several functions may have to generate a LOT of paths in order to compute some small data, and throw all the paths away once the computations have ended. Do you want do create high-level objects in those functions or not ?
Good point. So, you mean that these functions should actually not create fully-fledged quiver paths, but should just operate on instances of BoundedIntegerSequence
, and only in the end interpret them as paths, when the battle smoke has vanished.
And so I thought that you high-level objects would have a variable which would be this low-level object, etc, etc...
Yes, probably that's better.
So, perhaps I should really do cdef class BoundedIntegerSequence(tuple)
rather than cdef class BoundedIntegerSequence(MonoidElement)
.
But then, what to do with all these internally used constants that are needed even for the low-level functionality of BoundedIntegerSequence
? Would you prefer to compute and store these constants 10000 times when creating 10000 sequences of integers bounded by B
? Or would you prefer to compute and store these constants only once, namely in a parent that hosts all sequences of integers with a fixed bound?
Or do you have a third solution for storing/accessing these constants efficiently?
Helloooooo !!
Why? It should be a replacement for tuples (thus, low-level), but it should ideally be (close to) a drop-in replacement (thus, it doesn't matter whether one uses it by mistake).
Why do you want it to inherit from tuple ? I do not know the effect of such a thing, what it brings and what should be feared
Good point. So, you mean that these functions should actually not create fully-fledged quiver paths, but should just operate on instances of
BoundedIntegerSequence
, and only in the end interpret them as paths, when the battle smoke has vanished.
Well, yep. Only create the high-level stuff with possible overhead when you need it. And you end up knowing exactly what your code does when you are writing the code of your time-critical function.
So, perhaps I should really do
cdef class BoundedIntegerSequence(tuple)
rather thancdef class BoundedIntegerSequence(MonoidElement)
.
Hmmm.. I still don't get why you want it to inherit from tuple. My natural move would be to implement this as a C structure, not even a object :-P
But then, what to do with all these internally used constants that are needed even for the low-level functionality of
BoundedIntegerSequence
? Would you prefer to compute and store these constants 10000 times when creating 10000 sequences of integers bounded byB
? Or would you prefer to compute and store these constants only once, namely in a parent that hosts all sequences of integers with a fixed bound?
Well, somehow that's already what we do with graph backends. We have a Generic Backend, extended twice in Dense Graphs and Sparse Graphs. And all the constants needed by the data structures are stored in the corresponding files. You could have a BoundedIntegerSequence
class implementing no function at all, extended by alld ifferent implementations of the data structures you want. Even in the same file.
... And we don't need parents for that :-PPPP
Nathann
Replying to @nathanncohen:
Why? It should be a replacement for tuples (thus, low-level), but it should ideally be (close to) a drop-in replacement (thus, it doesn't matter whether one uses it by mistake).
Why do you want it to inherit from tuple ?
By analogy to Sequence_generic
(which inherits from list
):
sage: S = Sequence([1,2,3])
sage: type(S)
<class 'sage.structure.sequence.Sequence_generic'>
sage: type(S).mro()
[sage.structure.sequence.Sequence_generic,
sage.structure.sage_object.SageObject,
list,
object]
Well, somehow that's already what we do with graph backends. We have a Generic Backend, extended twice in Dense Graphs and Sparse Graphs. And all the constants needed by the data structures are stored in the corresponding files.
No, that's a different situation. If you have a sequence S1
of integers bounded by B1
and a sequence S2
of integers bounded by B2
then the constants for S1
are different from the constants of S2
. So, it is not really a "global" constant that is shared by all instances of a certain data structure (and could thus be put into a file), but when you do, for example, for x in B1
then you'll need different constants than when you do for x in B2
.
So, these constants are not shared by all sequences, but by all sequence that share the same bound. Thus, it seems natural (to me, at least) to have one object O(B)
that is shared by all sequences of bound B
, so that O(B)
provides the constants that belong to B
.
Sure, O(B)
could be written in C as well. But why not making it a parent, when all what we do is letting each sequence have a pointer to O(B)
?
By analogy to
Sequence_generic
(which inherits fromlist
):
?...
Well. And why is that a good idea ? :-P
No, that's a different situation. If you have a sequence
S1
of integers bounded byB1
and a sequenceS2
of integers bounded byB2
then the constants forS1
are different from the constants ofS2
. So, it is not really a "global" constant that is shared by all instances of a certain data structure (and could thus be put into a file), but when you do, for example,for x in B1
then you'll need different constants than when you dofor x in B2
.
Hmmmm... I see. Well, isn't your only parameter the number of bits you need to store each entry ? If it is, did you decided whether you want to allow it to be arbitrary, or if you want it to be a power of 2 ?
If it is a power of 2, then perhaps templates are sufficient, i.e. "one data structure per value of this parameter". There are not so many powers of 2 between 1 and 64 that make sense to store something :-)
I'd say.... 4,8,16,32 ? :-)
So, these constants are not shared by all sequences, but by all sequence that share the same bound. Thus, it seems natural (to me, at least) to have one object
O(B)
that is shared by all sequences of boundB
, so thatO(B)
provides the constants that belong toB
.
If your data structure is not an object, you could also add this width parameter as an input of the functions you use ? I guess all Bounded Sequence Integer that you use in the same function would have the same bound.
But admittedly I love to split hairs. Tell me if you think that this is going too far :-P
Sure,
O(B)
could be written in C as well. But why not making it a parent, when all what we do is letting each sequence have a pointer toO(B)
?
Ahahahah. That's up to you. I just hate Sage's parents/elements. I would live happy with a C pointer I can control, but not witha framework that I do not understand. If you understand it, then you know what you are doing. I don't :-P
Nathann
Replying to @nathanncohen:
Hmmmm... I see. Well, isn't your only parameter the number of bits you need to store each entry ?
No. We have
cdef Integer NumberArrows
cdef Integer IntegerMask
cdef size_t ArrowBitSize, BigArrowBitSize
cdef block* bitmask = NULL
cdef block DataMask, BitMask
cdef size_t ArrowsPerBlock, BigArrowsPerBlock
cdef size_t ModBAPB, DivBAPB, TimesBABS, TimesBlockSize
For example, ModBAPB
is used to replace n%BigArrowsPerBlock
by a bit-wise and
, which is noticeably faster.
If it is, did you decided whether you want to allow it to be arbitrary, or if you want it to be a power of 2 ?
Depending on the implementation...
If it is a power of 2, then perhaps templates are sufficient, i.e. "one data structure per value of this parameter". There are not so many powers of 2 between 1 and 64 that make sense to store something
:-)
Isn't duplication of code supposed to smell?
If your data structure is not an object, you could also add this width parameter as an input of the functions you use ? I guess all Bounded Sequence Integer that you use in the same function would have the same bound.
Do I understand correctly: You suggest that we should not provide a class BoundedIntegerSequence
that can be used interactively in Sage, but you suggest that we should just provide a couple of functions that operate on C structures (say, mpz_t
or usigned long*
) and can only be used in Cython code?
Then it would be better to revert this ticket to its original purpose: There would be...
mpz_t
resp. on unsigned long*
,Path
using mpz_t
resp. unsigned long*
as a cdef attribute, using the afore mentioned functions to implement concatenation and iteration,Path
but with mpz_t*
resp unsigned long**
(which will be the case anyway).But admittedly I love to split hairs. Tell me if you think that this is going too far
:-P
No problem. I just want to know if people see the need to have a Cython implementation of sequences of bounded integers, for use in Python.
Ahahahah. That's up to you. I just hate Sage's parents/elements. I would live happy with a C pointer I can control, but not witha framework that I do not understand.
Making it something like a container rather than a fully fledged parent would be an option, from my perspective.
Replying to @simon-king-jena:
Making it something like a container rather than a fully fledged parent would be an option, from my perspective.
... I mean, for BoundedIntegerSequence
. Of course, Path
must have a parent (the path semigroup of a quiver).
Just a random thought. Some issues here are of the same nature as what was discussed around ClonableElement and its subclasses. Maybe there is stuff to share with it, especially if one ends up inheriting from Element.
I just had a look at ClonableElement
, and one immediate thought is: Why are C-pointers allocated in __init__
? That's clearly the purpose of __cinit__
, in particular if said C-pointers are freed in __dealloc__
.
Do we want to have sequences of bounded integers for use in Python? Then one possible way would be:
ClonableElement
(so that we can have sequences of bounded integers used in python)Path
class, to be used in path semigroups.Yooooooo !!!
No. We have
cdef Integer NumberArrows cdef Integer IntegerMask cdef size_t ArrowBitSize, BigArrowBitSize cdef block* bitmask = NULL cdef block DataMask, BitMask cdef size_t ArrowsPerBlock, BigArrowsPerBlock cdef size_t ModBAPB, DivBAPB, TimesBABS, TimesBlockSize
Hmmmm.. Can't they all be deduced from the type of implementation (compressed long, for instance) and the number of bits per entry then ?
For example,
ModBAPB
is used to replacen%BigArrowsPerBlock
by a bit-wiseand
, which is noticeably faster.
Yepyep of course.
If it is, did you decided whether you want to allow it to be arbitrary, or if you want it to be a power of 2 ?
Depending on the implementation...
Okay....
If it is a power of 2, then perhaps templates are sufficient, i.e. "one data structure per value of this parameter". There are not so many powers of 2 between 1 and 64 that make sense to store something
:-)
Isn't duplication of code supposed to smell?
Well, if it is a template the code is only implemented once. And it will generate several data structures at compile time indeed, but each of them will have a hardcoded constant, and you only implement one.
Do I understand correctly: You suggest that we should not provide a class
BoundedIntegerSequence
that can be used interactively in Sage, but you suggest that we should just provide a couple of functions that operate on C structures (say,mpz_t
orusigned long*
) and can only be used in Cython code?
That's what I had in mind, but you make decisions here, pick what you like.
It's a bit how bitsets are implemented. A C data structure, and an object wrapper on top of it. Or C Graphs, with a data structure on top of it. It is how Permutations should be implemented, too :-P
Well, you have a C object that you use when you want performance, and when you don't care you use the higher-level implementations.
Then it would be better to revert this ticket to its original purpose: There would be...
- ... cdef functions operating on
mpz_t
resp. onunsigned long*
,- ... a Cython class
Path
usingmpz_t
resp.unsigned long*
as a cdef attribute, using the afore mentioned functions to implement concatenation and iteration,- ... a subsequent ticket, implementing an F5 style algorithm to compute standard bases, operating not with
Path
but withmpz_t*
respunsigned long**
(which will be the case anyway).
Well, I like this plan. What do you think ? Does it displease you in any way ?
No problem. I just want to know if people see the need to have a Cython implementation of sequences of bounded integers, for use in Python.
I do not see how I could use it in my code at the moment, but I like the idea of having such a data structure somewhere, ready to be used. Most importantly, if you want speed, I think that having it available at two different levels is a safe bet. You write more code, but you will know how computations are spent in your time-critical functions.
Nathann
Hi Nathann,
Replying to @nathanncohen:
Hmmmm.. Can't they all be deduced from the type of implementation (compressed long, for instance) and the number of bits per entry then ?
Sure. And you want to do this same computation 10^7
times?
Isn't duplication of code supposed to smell?
Well, if it is a template the code is only implemented once. And it will generate several data structures at compile time indeed, but each of them will have a hardcoded constant, and you only implement one.
Yes. Probably my problem is that I don't think "C++", but I think "Cython". And there it is a bit clumsy (I think templating is mimicked in Cython by including .pxi files, isn't it?).
That's what I had in mind, but you make decisions here, pick what you like.
It's a bit how bitsets are implemented.
Right, that's what I have looked at, and think makes sense here. Up to the constants.
Then it would be better to revert this ticket to its original purpose: There would be...
- ... cdef functions operating on
mpz_t
resp. onunsigned long*
,- ... a Cython class
Path
usingmpz_t
resp.unsigned long*
as a cdef attribute, using the afore mentioned functions to implement concatenation and iteration,- ... a subsequent ticket, implementing an F5 style algorithm to compute standard bases, operating not with
Path
but withmpz_t*
respunsigned long**
(which will be the case anyway).Well, I like this plan. What do you think ? Does it displease you in any way ?
No, it is one possibility.
Sure. And you want to do this same computation
10^7
times?
I don't get it. If the width is a power of two, then there are 6 possible values, i.e. 2^1, 2^2, 2^3, 2^4, 2^5, 2^6. Otherwise you have at most 64, but really who cares in the end ?
Yes. Probably my problem is that I don't think "C++", but I think "Cython". And there it is a bit clumsy (I think templating is mimicked in Cython by including .pxi files, isn't it?).
I am thinking of this : http://stackoverflow.com/questions/4840819/c-template-specialization-with-constant-value
That's C++, and I don't see how to do it in Cython, though. The point is that you can parameter a data structure with a constant value, and the data structure will be compiled for each value of the constant that appears in the code at compile-time (not all possible values, or course). The point is that the code is optimized while knowing the value(which can lead to optimization) and also that you can do things that would not be possible otherwise, like "cdef int a[k]" where k is a (templated) variable.
Nathann
Replying to @nathanncohen:
Sure. And you want to do this same computation
10^7
times?I don't get it. If the width is a power of two, then there are 6 possible values, i.e. 2^1, 2^2, 2^3, 2^4, 2^5, 2^6. Otherwise you have at most 64, but really who cares in the end ?
Apparently I am not used to templated code...
I was somehow thinking that you suggested to keep the bitlength of the upper bound for the integers as the only parameter, which means that you would need to compute all the other things either for each operation or (when storing it as attribute of the sequences) for each sequence.
I am thinking of this : http://stackoverflow.com/questions/4840819/c-template-specialization-with-constant-value
That's C++, and I don't see how to do it in Cython, though.
Part of the question is: Do we want to use gmp as backend? Or do we want to (re-)implement operations on densely packed long*
by ourselves? Given the timings, gmp is an option.
Anyway, I'll have a look at the stackoverflow discussion. Thank you for the pointer!
The next question then is how to access the code from Cython. I've done this with C, but not C++.
Apparently I am not used to templated code...
I was somehow thinking that you suggested to keep the bitlength of the upper bound for the integers as the only parameter, which means that you would need to compute all the other things either for each operation or (when storing it as attribute of the sequences) for each sequence.
Nonono. Somehow, you can think of templates as a way to "script" the generation of code. You write some code with a constant k which is never defined anywhere, and then you say "compile one version of this code for each k in 1, 2, 3, 4, 5, 6. You end up with several data structures implemented, each with a hardcoded value. And then you can create an object cdef LongCompressedPath<5> my_path
in which the hardcoded bound is 2^5
. You just have many additional types, each with its own constant.
Part of the question is: Do we want to use gmp as backend? Or do we want to (re-)implement operations on densely packed
long*
by ourselves? Given the timings, gmp is an option.
Yep, you are right. Depends on how you weight the operations that you will need: it looks okay for concatenations and bad for startwith, but I expect that the most time consuming operation will be concatenation ? You are the one who needs it :-)
Anyway, I'll have a look at the stackoverflow discussion. Thank you for the pointer!
The next question then is how to access the code from Cython. I've done this with C, but not C++.
HMmmmmmm.... There is a C++ interface in numerical/backends/coin_backend.pyx but it does not play with templates. I am interested in knowing if this kind of templates can be written in pure Cython, do you mind if I ask the question on cython-devel (did you want to send one yourself) ?
Nathann
Hi Nathann,
sorry for the long silence!
Replying to @nathanncohen:
HMmmmmmm.... There is a C++ interface in numerical/backends/coin_backend.pyx but it does not play with templates. I am interested in knowing if this kind of templates can be written in pure Cython, do you mind if I ask the question on cython-devel (did you want to send one yourself) ?
Did you find out how templated code can be written in Cython? If you have asked on cython-devel (so far I have only posted on cython-users) or have another pointer, then please tell me. If not, tell me also...
Cheers, Simon
Hi Nathann,
wouldn't the following be close to templates?
In Cython:
cimport cython
ctypedef fused str_or_int:
str
int
cpdef str_or_int times_two(str_or_int var):
return var+var
def show_me():
cdef str a = 'a'
cdef int b = 3
print 'str', times_two(a)
print 'int', times_two(b)
and this results in
sage: show_me()
str aa
int 6
Or perhaps a better example, showing that the fused type is used to choose code at compile time: Put the following into a cell of the notebook
%cython
cimport cython
ctypedef fused str_or_int:
str
int
cpdef str_or_int times_two(str_or_int var):
if str_or_int is int:
return var*2
else:
return var+var
def test():
cdef int a = 3
cdef str b = 'b'
a = times_two(a)
b = times_two(b)
and look at the resulting code. The line a = times_two(a)
becomes
__pyx_v_a = __pyx_fuse_1__pyx_f_68_home_king__sage_sage_notebook_sagenb_home_admin_2_code_sage4_spyx_0_times_two(__pyx_v_a, 0);
whereas b = times_two(b)
becomes
__pyx_t_1 = __pyx_fuse_0__pyx_f_68_home_king__sage_sage_notebook_sagenb_home_admin_2_code_sage4_spyx_0_times_two(__pyx_v_b, 0);
and one can check that Cython created two different C-functions __pyx_fuse_0...
and __pyx_fuse_1...
for the two possible (type-dependent!) versions of times_two
.
And here an example showing that it pays off, speed-wise:
In a .pyx file:
cimport cython
ctypedef fused str_or_int:
str
int
cpdef str_or_int times_two(str_or_int var):
if str_or_int is int:
return var+var
else:
return var+var
cpdef untyped_times_two(var):
return var+var
cpdef int int_times_two(int var):
return var+var
cpdef str str_times_two(str var):
return var+var
def test1():
cdef int a = 3
cdef str b = 'b'
a = times_two(a)
b = times_two(b)
def test2():
cdef int a = 3
cdef str b = 'b'
a = untyped_times_two(a)
b = untyped_times_two(b)
def test3():
cdef int a = 3
cdef str b = 'b'
a = int_times_two(a)
b = str_times_two(b)
Attach the .pyx file to Sage, and get this:
age: %attach /home/king/Sage/work/templates/test.pyx
Compiling /home/king/Sage/work/templates/test.pyx...
sage: %timeit test1()
10000000 loops, best of 3: 159 ns per loop
sage: %timeit test2()
1000000 loops, best of 3: 199 ns per loop
sage: %timeit test3()
10000000 loops, best of 3: 168 ns per loop
EDIT: I have replaced the original example by something fairer, where it says var+var
everywhere, but in the templated version Cython can choose the c-implementation of var+var
at compile time, whereas it can't (and hence loses time when running the code) in the untemplated version.
EDIT 2: The new example also shows that choosing a specialised implementation manually is not faster than Cython using templates.
Hellooooooooooooooooo !!!
EDIT: I have replaced the original example by something fairer, where it says
var+var
everywhere, but in the templated version Cython can choose the c-implementation ofvar+var
at compile time, whereas it can't (and hence loses time when running the code) in the untemplated version.
HMmmmmm... Does it make any difference ? gcc should resolve those "if" at compile-time, so if Cython creates two version of the function with an "if" inside depending on the type, gcc should not include it in the final binary as the constants involved can be defined at compile-time.
Anyway, that's a good news ! :-D
Nathann
Yooooooooooo !!
HMmmmmmm.... There is a C++ interface in numerical/backends/coin_backend.pyx but it does not play with templates. I am interested in knowing if this kind of templates can be written in pure Cython, do you mind if I ask the question on cython-devel (did you want to send one yourself) ?
Did you find out how templated code can be written in Cython? If you have asked on cython-devel (so far I have only posted on cython-users) or have another pointer, then please tell me. If not, tell me also...
Ahahah.. Well, no ... When I ask a question I do not do anything before I get an answer, soooooooooo I never did it :-P
I ask a question, then forget everything about it. When I get the answer I do what should be done :-P
Nathann
Hi Nathann,
Replying to @nathanncohen:
HMmmmmm... Does it make any difference ? gcc should resolve those "if" at compile-time, so if Cython creates two version of the function with an "if" inside depending on the type, gcc should not include it in the final binary as the constants involved can be defined at compile-time.
Yes, it is indeed resolved at compile time. Two c-functions are created, and only the part of the code relevant for the respective code branch appears in the c-function.
In other words, I now plan to create cdef functions like this:
ctypedef seq_t:
first_implementation
second_implementation
cdef seq_t concat(seq_t x, seq_t y):
if seq_t is first_implementation:
<do stuff for 1st implementation>
else:
<do stuff for 2nd implementation>
and then, in other modules (perhaps also in src/sage/combinat/words/word_datatypes.pyx
?), one could uniformly call concat(a,b)
, provided that the type of a
and b
are known at compile time (or write concat[first_implementation](a,b)
explicitly).
I think that such Cython code will be sufficiently nice to read, and the resulting C-code will be sufficiently fast.
Anyway, that's a good news !
:-D
Indeed!
Simon
I think that such Cython code will be sufficiently nice to read, and the resulting C-code will be sufficiently fast.
+1
Nathann
Hmmm. Meanwhile I am a bit less positive about fused types. Let seq_t
be a fused type for different implementations.
Fused types can not be used in iterators. I.e., the following is refused:
def iter_seq(seq_t S):
for <bla>:
<do something>
yield x
Fused types can not be used as attributes of cdef classes. I.e., the following is refused:
cdef class BoundedIntSeq:
cdef seq_t seq
def __mul__(self, other):
<do type check>
cdef seq_t out = concat_seq(self.seq, other.seq)
<create and return new instance of BoundedIntSeq using "out">
Hence, it seems that one needs to create two separate iterators, and it also seems that one needs code duplication. I.e., when impl1
and impl2
are two specialisations of the fused type seq_t
, then one has to have
cdef class BoundedIntSeq_impl1:
cdef impl1 seq
def __mul__(self, other):
...
cdef impl1 out = concat_seq(self.seq, other.seq)
cdef class BoundedIntSeq_impl2:
cdef impl2 seq
def __mul__(self, other):
...
cdef impl2 out = concat_seq(self.seq, other.seq)
This mainly amounts to cut-and-paste, but is ugly!
I think I'll ask on cython-users if there are known ideas to solve this more elegantly.
Some variation on the theme: One can create fused types from two cdef classes, and then one can work with specialisations of functions as follows:
ctypedef fused str_or_int:
str
int
# make the function do different things on different types,
# to better see the difference
cpdef str_or_int times_two(str_or_int var):
if str_or_int is int:
return var+var
else:
return var+'+'+var
# forward declaration of cdef classes and their fusion
cdef class Bar_int
cdef class Bar_str
cdef fused Bar:
Bar_int
Bar_str
# a function that can be specialised to either of the two extension classes
def twice(Bar self):
# and inside, we specialised a function depending
# on the type of an attribute --- at compile time!!
return type(self)(times_two(self.data))
cdef class Bar_str:
cdef str data
def __init__(self, data):
self.data = data
def __repr__(self):
return self.data
# First possibility: Use the "templated" function as a method
twice_ = twice[Bar_str]
# second possibility: Wrap the "templated" function
cpdef Bar_str twice__(self):
return twice(self)
# The same, with the other Specialisation of <Bar>
cdef class Bar_int:
cdef int data
def __init__(self, data):
self.data = data
def __repr__(self):
return "int(%d)"%self.data
twice_ = twice[Bar_int]
cpdef Bar_int twice__(self):
return twice(self)
Time-wise, we see that wrapping the "templated" function is slower than using it as an attribute:
sage: b1 = Bar_str('abc')
sage: b1.twice_()
abc+abc
sage: b1.twice__()
abc+abc
sage: %timeit b1.twice_()
1000000 loops, best of 3: 759 ns per loop
sage: %timeit b1.twice__()
100000 loops, best of 3: 13.7 µs per loop
sage: b2 = Bar_int(5)
sage: b2.twice_()
int(10)
sage: b2.twice__()
int(10)
sage: %timeit b2.twice_()
1000000 loops, best of 3: 539 ns per loop
sage: %timeit b2.twice__()
100000 loops, best of 3: 9.2 µs per loop
Summary and further comments
We can reduce the code duplication to something like this:
<define templated function foo_ depending on a fused type Bar with specialisations Bar1, Bar2,...>
cdef class Bar1:
foo = foo_[Bar1]
cdef class Bar2:
foo = foo_[Bar2]
...
I hope this level of copy-and-paste can be afforded. At least, it gives good speed.
In my tests I found the following:
foo_
a cpdef function. Otherwise, the assignment foo = foo_
fails with the statement that foo_
can not be turned into a Python type.class Bar1:
foo_ = foo_
results in an error saying that Bar1
has not attribute foo_
.
twice_
and twice__
have been put into the class' __dict__
. I did not succeed to turn it into the equivalent of a cdef method.I did a lot of benchmark tests for the different implementations (using long*
or char*
or mpz_t
to store sequences of bounded integers, using cdef functions that operate on fused types) and eventually found that using GMP (i.e., mpz_t
) is fastest in all tests, by a wide margin.
Why did I not find this result before? Well, part of the reason is that GMP sometimes provides different functions to do the same job, and in the past few days I considerably improved my usage of the various possibilities. For example, mpz_t
can be initialised in different ways. If one uses mpz_init
, then the assignment that will follow results in a re-allocation. Since the length of the concatenation of two tuples is known in advance, prescribing the number of to-be-allocated bits with mpz_init2
resulted in a speed-up of concatenation, actually it became twice as fast.
And this result is not surprising. After all, GMP has to implement the same bitshift and comparison operations for bit arrays that I was implementing in the "long*
" approach, too. But GMP probably uses implementation techniques that I can not even dream of.
Now, my plan is to
ctypedef struct bounded_int_tuple
that is based on mpz_t
and is the underlying C structure for tuples of bounded integersbounded_int_tuple
, for concatenation, copying, comparison and so onBoundedIntegerTuple
. This will probably not derive from Element
: I think it makes no difference from the point of view of memory efficiency whether each BoundedIntegerTuple
stores a pointer to the parent of "all tuples of integers bounded by B
" or directly stores B
as an unsigned int.In a subsequent ticket, I plan to use it for quiver paths, and if the combinat crowd cares, they can use it to make parts of sage.combinat.words a lot faster.
Comments? Requests? Suggestion of alternatives?
None from me. I keep being impressed at how much testing/researching you put into this quiver features... I really wonder what you will do with it in the end ;-)
Nathann
Branch: u/SimonKing/ticket/15820
I have pushed a branch with an initial implementation of bounded integer sequences. I took care of error handling: I think you will find that those GMP commands that may result in a memory error are wrapped in sig_on()/sig_off(), and the cdef boilerplate functions do propagate errors. In fact, in some of my tests, I was hitting Ctrl-C, and it worked.
The rest of this post is about benchmarks. I compare against Python tuples---which is of course ambitious, and you will find that not in all situations tuples are slower than bounded integer sequences. However, one should see it positively: The bounded integer sequences implemented here have features very similar to tuples, and there are operations for which they are a lot faster than tuples. It all depends on the length and the bound of the sequence. That's why I think that it is a reasonable contribution. And to make it faster (without Python classes), one can still use the cdef boilerplate functions. Generally, it seems that long bounded integer sequences are faster than long tuples, but short tuples are faster than short sequences. For hash, bounded integer sequences are pretty good, but they suck in accessing items, or in slicing with step different from 1.
In contrast to tuples, bounded integer sequences also provide fairly quick tests for whether a sequences starts with a given sequence, or whether a sequence contains a certain sub-sequence. This is a feature I took from strings.
Here are the tests, with different sizes and bounds. Variable names starting with "T" hold tuples, those starting with "S" hold bounded integer sequences.
sage: from sage.structure.bounded_integer_sequences import BoundedIntegerSequence
sage: L0 = [randint(0,7) for i in range(5)]
sage: L1 = [randint(0,15) for i in range(15)]
sage: L2 = [randint(0,31) for i in range(50)]
sage: L3 = [randint(0,31) for i in range(5000)]
sage: T0 = tuple(L0); T1 = tuple(L1); T2 = tuple(L2); T3 = tuple(L3)
sage: S0 = BoundedIntegerSequence(8, L0)
sage: S1 = BoundedIntegerSequence(16, L1)
sage: S2 = BoundedIntegerSequence(32, L2)
sage: S3 = BoundedIntegerSequence(32, L3)
Iteration
sage: timeit("x=[y for y in T0]", number=1000000)
1000000 loops, best of 3: 783 ns per loop
sage: timeit("x=[y for y in T1]", number=1000000)
1000000 loops, best of 3: 1.48 µs per loop
sage: timeit("x=[y for y in T2]", number=100000)
100000 loops, best of 3: 4.3 µs per loop
sage: timeit("x=[y for y in T3]", number=1000)
1000 loops, best of 3: 245 µs per loop
sage: timeit("x=[y for y in S0]", number=1000000)
1000000 loops, best of 3: 2.38 µs per loop
sage: timeit("x=[y for y in S1]", number=100000)
100000 loops, best of 3: 4.1 µs per loop
sage: timeit("x=[y for y in S2]", number=100000)
100000 loops, best of 3: 10.1 µs per loop
sage: timeit("x=[y for y in S3]", number=1000)
1000 loops, best of 3: 1.79 ms per loop
Slicing
Bounded integer sequences are immutable and hence copied by identity. But let us do slicing, dropping the last item:
sage: timeit("x=T3[:-1]", number=100000)
100000 loops, best of 3: 19.5 µs per loop
sage: timeit("x=S3[:-1]", number=100000)
100000 loops, best of 3: 3.48 µs per loop
Slicing with step!=1
is much slower, though:
sage: timeit("x=T3[:-1:2]", number=100000)
100000 loops, best of 3: 11.7 µs per loop
sage: timeit("x=S3[:-1:2]", number=1000)
1000 loops, best of 3: 2.23 ms per loop
Perhaps I should mark it "TODO"...
Accessing single items
Short sequences
sage: timeit("x=T0[1]", number=1000000)
1000000 loops, best of 3: 361 ns per loop
sage: timeit("x=T0[4]", number=1000000)
1000000 loops, best of 3: 373 ns per loop
sage: timeit("x=S0[1]", number=1000000)
1000000 loops, best of 3: 959 ns per loop
sage: timeit("x=S0[4]", number=1000000)
1000000 loops, best of 3: 960 ns per loop
Large sequences (time also depends on the bound for the integer sequence!)
sage: timeit("x=T3[1]", number=1000000)
1000000 loops, best of 3: 359 ns per loop
sage: timeit("x=T3[4500]", number=1000000)
1000000 loops, best of 3: 382 ns per loop
sage: timeit("x=S3[1]", number=1000000)
1000000 loops, best of 3: 1.97 µs per loop
sage: timeit("x=S3[4500]", number=1000000)
1000000 loops, best of 3: 1.49 µs per loop
Comparison
Note that comparison of bounded integer sequences works different from comparison of tuples or lists, as detailed in the documentation.
We compare sequences that are equal but non-identical, that differ in early items, or that differ in late items.
sage: L0x = [randint(0,7) for i in range(5)]
sage: L1x = [randint(0,15) for i in range(15)]
sage: L2x = [randint(0,31) for i in range(50)]
sage: L3x = [randint(0,31) for i in range(5000)] # verified that they differ from L0,L1,L2,L3
sage: T0x = tuple(L0x); T1x = tuple(L1x); T2x = tuple(L2x); T3x = tuple(L3x)
sage: S0x = BoundedIntegerSequence(8, L0x)
sage: S1x = BoundedIntegerSequence(16, L1x)
sage: S2x = BoundedIntegerSequence(32, L2x)
sage: S3x = BoundedIntegerSequence(32, L3x)
sage: T0y = tuple(L0); T1y = tuple(L1); T2y = tuple(L2); T3y = tuple(L3)
sage: S1y = BoundedIntegerSequence(8, L1)
sage: S2y = BoundedIntegerSequence(32, L2)
sage: S3y = BoundedIntegerSequence(32, L3)
sage: S1y==S1!=S1x, S1y is not S1
(True, True)
Early differences:
sage: timeit("T0==T0x", number=1000000)
1000000 loops, best of 3: 143 ns per loop
sage: timeit("T1==T1x", number=1000000)
1000000 loops, best of 3: 145 ns per loop
sage: timeit("T2==T2x", number=1000000)
1000000 loops, best of 3: 143 ns per loop
sage: timeit("T3==T3x", number=1000000)
1000000 loops, best of 3: 161 ns per loop
sage: timeit("S0==S0x", number=1000000)
1000000 loops, best of 3: 538 ns per loop
sage: timeit("S1==S1x", number=1000000)
1000000 loops, best of 3: 550 ns per loop
sage: timeit("S2==S2x", number=1000000)
1000000 loops, best of 3: 490 ns per loop
sage: timeit("S3==S3x", number=1000000)
1000000 loops, best of 3: 559 ns per loop
Equal sequences:
sage: timeit("T0==T0y", number=1000000)
1000000 loops, best of 3: 169 ns per loop
sage: timeit("T1==T1y", number=1000000)
1000000 loops, best of 3: 255 ns per loop
sage: timeit("T2==T2y", number=1000000)
1000000 loops, best of 3: 597 ns per loop
sage: timeit("T3==T3y", number=100000)
100000 loops, best of 3: 47.8 µs per loop
sage: timeit("S0==S0y", number=1000000)
1000000 loops, best of 3: 511 ns per loop
sage: timeit("S1==S1y", number=1000000)
1000000 loops, best of 3: 493 ns per loop
sage: timeit("S2==S2y", number=1000000)
1000000 loops, best of 3: 583 ns per loop
sage: timeit("S3==S3y", number=1000000)
1000000 loops, best of 3: 1.41 µs per loop
Late differences:
sage: T0z1 = T0+T0
sage: T0z2 = T0+T0x
sage: T1z1 = T1+T1
sage: T1z2 = T1+T1x
sage: T2z1 = T2+T2
sage: T2z2 = T2+T2x
sage: T3z1 = T3+T3
sage: T3z2 = T3+T3x
sage: timeit("T0z1==T0z2", number=100000)
100000 loops, best of 3: 206 ns per loop
sage: timeit("T1z1==T1z2", number=100000)
100000 loops, best of 3: 308 ns per loop
sage: timeit("T2z1==T2z2", number=100000)
100000 loops, best of 3: 640 ns per loop
sage: timeit("T3z1==T3z2", number=100000)
100000 loops, best of 3: 47.8 µs per loop
sage: S0z1 = S0+S0
sage: S0z2 = S0+S0x
sage: S1z1 = S1+S1
sage: S1z2 = S1+S1x
sage: S2z1 = S2+S2
sage: S2z2 = S2+S2x
sage: S3z1 = S3+S3
sage: S3z2 = S3+S3x
sage: timeit("S0z1==S0z2", number=100000)
100000 loops, best of 3: 585 ns per loop
sage: timeit("S1z1==S1z2", number=100000)
100000 loops, best of 3: 494 ns per loop
sage: timeit("S2z1==S2z2", number=100000)
100000 loops, best of 3: 555 ns per loop
sage: timeit("S3z1==S3z2", number=100000)
100000 loops, best of 3: 578 ns per loop
Hash
This is the only operation for which bounded integer sequences seem to be consistently faster than tuples:
sage: timeit("hash(T0)", number=100000)
100000 loops, best of 3: 179 ns per loop
sage: timeit("hash(T0)", number=1000000)
1000000 loops, best of 3: 177 ns per loop
sage: timeit("hash(T1)", number=1000000)
1000000 loops, best of 3: 267 ns per loop
sage: timeit("hash(T2)", number=1000000)
1000000 loops, best of 3: 584 ns per loop
sage: timeit("hash(T3)", number=100000)
100000 loops, best of 3: 45.3 µs per loop
sage: timeit("hash(S0)", number=1000000)
1000000 loops, best of 3: 145 ns per loop
sage: timeit("hash(S1)", number=1000000)
1000000 loops, best of 3: 153 ns per loop
sage: timeit("hash(S2)", number=1000000)
1000000 loops, best of 3: 194 ns per loop
sage: timeit("hash(S3)", number=1000000)
1000000 loops, best of 3: 8.97 µs per loop
Concatenation
sage: timeit("T0+T0", number=1000000)
1000000 loops, best of 3: 200 ns per loop
sage: timeit("T1+T1", number=1000000)
1000000 loops, best of 3: 311 ns per loop
sage: timeit("T2+T2", number=1000000)
1000000 loops, best of 3: 742 ns per loop
sage: timeit("T3+T3", number=10000)
10000 loops, best of 3: 40.3 µs per loop
sage: timeit("S0+S0", number=1000000)
1000000 loops, best of 3: 576 ns per loop
sage: timeit("S1+S1", number=1000000)
1000000 loops, best of 3: 590 ns per loop
sage: timeit("S2+S2", number=1000000)
1000000 loops, best of 3: 618 ns per loop
sage: timeit("S3+S3", number=1000000)
1000000 loops, best of 3: 2.17 µs per loop
Subsequences
Recall the definition of S0z1
etc. We find:
sage: S0z2.startswith(S0)
True
sage: S0z2.startswith(S0x)
False
sage: S0x in S0z2
True
sage: timeit("S0z2.startswith(S0)", number=1000000)
1000000 loops, best of 3: 239 ns per loop
sage: timeit("S0z2.startswith(S0x)", number=1000000)
1000000 loops, best of 3: 241 ns per loop
sage: timeit("S0x in S0z2", number=1000000)
1000000 loops, best of 3: 694 ns per loop
sage: timeit("S1z2.startswith(S1)", number=1000000)
1000000 loops, best of 3: 227 ns per loop
sage: timeit("S1z2.startswith(S1x)", number=1000000)
1000000 loops, best of 3: 223 ns per loop
sage: timeit("S1x in S1z2", number=1000000)
1000000 loops, best of 3: 1.08 µs per loop
sage: timeit("S2z2.startswith(S2)", number=1000000)
1000000 loops, best of 3: 247 ns per loop
sage: timeit("S2z2.startswith(S2x)", number=1000000)
1000000 loops, best of 3: 230 ns per loop
sage: timeit("S2x in S2z2", number=1000000)
1000000 loops, best of 3: 3.21 µs per loop
sage: timeit("S3z2.startswith(S3)", number=1000000)
1000000 loops, best of 3: 989 ns per loop
sage: timeit("S3z2.startswith(S3x)", number=1000000)
1000000 loops, best of 3: 218 ns per loop
sage: timeit("S3x in S3z2", number=1000)
1000 loops, best of 3: 3.57 ms per loop
I wonder if the latter could be improved. Another "TODO"...
New commits:
5dc78c5 | Implement sequences of bounded integers |
12630 has introduced tools to compute with quiver algebras and quiver representations. It is all Python code, paths being internally stored by lists of labelled edges of digraphs.
The aim of this ticket is to provide an efficient implementation of lists of bounded integers, with concatenation. This can be used to implement paths of a quiver more efficiently, but can also have different applications (say, elements of free groups, or other applications in sage-combinat).
Depends on #17195 Depends on #17196
CC: @sagetrac-JStarx @jhpalmieri @williamstein @stumpc5 @saliola @simon-king-jena @sagetrac-gmoose05 @nthiery @avirmaux @nathanncohen @hivert
Component: algebra
Keywords: sequence bounded integer
Author: Simon King, Jeroen Demeyer
Branch:
4e9e1c5
Reviewer: Jeroen Demeyer, Simon King
Issue created by migration from https://trac.sagemath.org/ticket/15820