sagemath / sage

Main repository of SageMath

https://www.sagemath.org

Other

1.45k stars 481 forks source link

Implement sequences of bounded integers #15820

Closed simon-king-jena closed 9 years ago

simon-king-jena commented 10 years ago

12630 has introduced tools to compute with quiver algebras and quiver representations. It is all Python code, paths being internally stored by lists of labelled edges of digraphs.

The aim of this ticket is to provide an efficient implementation of lists of bounded integers, with concatenation. This can be used to implement paths of a quiver more efficiently, but can also have different applications (say, elements of free groups, or other applications in sage-combinat).

Depends on #17195 Depends on #17196

CC: @sagetrac-JStarx @jhpalmieri @williamstein @stumpc5 @saliola @simon-king-jena @sagetrac-gmoose05 @nthiery @avirmaux @nathanncohen @hivert

Component: algebra

Keywords: sequence bounded integer

Author: Simon King, Jeroen Demeyer

Branch: 4e9e1c5

Reviewer: Jeroen Demeyer, Simon King

Issue created by migration from https://trac.sagemath.org/ticket/15820

simon-king-jena commented 10 years ago

comment:1

There are various possible more or less obvious data formats:

Store a path as char* or unsigned int*, hence, as a list of integers, each integer corresponding to one edge of the quiver. char* would bound the total number of edges of the quiver to 256 (which might be too small), whereas unsigned int* might consume to much memory.
Store a path as char*, each integer corresponding to an outgoing edge of a vertex. Hence, the total number of edges may be arbitrary, we only have a local bound of 128 on the number of outgoing edges---this hopefully is enough (it certainly is enough for me).
As in 1., but the representation is compressed: If we have less than 128 edges, then one byte can represent a pair of edges, and a single 32- or 64-bit int could store even longer sequences of edges.
As in 2., but the representation is compressed: If each vertex has less than 8 outgoing edges, then a byte can represent a pair and a single 32-bit word can represent a 10-tuple of edges (if I did not miscalculate).

In 2. and 4. (and probably in 1. and 3. as well) we should explicitly store the start and end point of the path. The length should probably be stored as well.

In my applications, the following operations on paths are needed to be fast: Concatenation, hash/comparison, testing whether a path p starts with a path q, and less important whether a path p contains a path q.

Concatenation is particularly easy in 1. and 2. (just concatenation of lists, which is achieved by memcopy). It is a bit more difficult in 3. and 4., where the bytes of the second factor need to be shifted to get the compressed data seamless.

Hash and comparison are easy in all four formats, but of course best speed would be obtained in 4., simply since the amount of data that needs to be processed is smaller than in 1.-3.

In all four proposed data formats, it is easy to test whether path p starts with path q. Again, having the data stored in as little memory as possible will speed things up.

It is easy in 1. to test whether path p contains path q (as a subpath that is not necessarily an initial segment), whereas this becomes a bit more difficult in 2.-4.

What format would you suggest? Perhaps it makes sense to have some initial implementations of all four, and then compare directly.

nthiery commented 10 years ago

comment:2

I'd say:

Some implementation without limitations (the one from the previous ticket I guess)
Whichever of 1-4 that covers your needs (256 edges for Gröbner bases calculations this should be already quite big, right?)
As much as possible of the code shared between all implementations to make it easy to add new ones if deemed useful.

But that's just my 2 cents ... I don't have specific experience!

Cheers,

simon-king-jena commented 10 years ago

comment:3

Hi Nicolas,

Replying to @nthiery:

Some implementation without limitations (the one from the previous ticket I guess)

An implementation which represents edges by unsigned long long can be considered "unlimited".

Whichever of 1-4 that covers your needs (256 edges for Gröbner bases calculations this should be already quite big, right?)

Yes, but I want a general code.

As much as possible of the code shared between all implementations to make it easy to add new ones if deemed useful.

Actually, on the level of paths, the implementations come in pairs: The compressed representations differ from the non-compressed---at the end of the day, we just have different ways of representing sequences of integers. Only when it comes to interpreting a sequence of integers as a path in a quiver will we have further differences: Does "[1,2]" mean "start with arrow 1 of the quiver and continue with arrow 2 of the quiver", or does it mean "travel along the outgoing arrow 1 of the vertex we are currently located at, and then continue with outgoing arrow 2 of the vertex we just arrived at".

Currently, I prefer a compressed representation by unsigned long long*. Rationale:

This will work even for extremely large quivers---the only restriction is, that the number of edges can be represented as an unsigned long long (a larger quiver won't fit into computer's memory anyway).
For small quivers, even paths of length 10 will easily fit into a single unsigned long long. That's quite efficient, both memory- and speed-wise.

A bit later I will post experimental code (not yet talking about quiver paths but about sequences of integers) giving some evidence on efficiency. And then we can still see how we interpret the integer sequences: Having a global or a local list of arrows?

simon-king-jena commented 10 years ago

comment:4

I have attached a toy (or proof-of-concept) implementation; see attachment: path_test.pyx. In a real implementation in compressed representation, it would be the job of a PathSemigroup to find out how many edges fit into one word. And of course, in the toy implementation I am not trying to translate an integer sequence into the string representation of an element of the path semigroup.

What implementation would you suggest to use in "reality"? Is there another implementation of "sequences of bounded integers" that you would recommend? And would you enumerate the arrows of a quiver locally (i.e., only number the outgoing edges at each vertex) or globally? The former will result in better compression, but would make it more difficult to test for paths p,q whether there are paths r,s with p==r*q*s.

Now for the timings, and sorry for the long post...

Here, I am timing concatenation, hash, comparison, comparing initial segments. In all examples, we use the following paths of various sizes.

p5 = Path(ZZ, 5, range(5), 1, 1)
p25 = Path(ZZ, 25, range(25), 1, 1)
p50 = Path(ZZ, 50, range(50), 1, 1)
p100 = Path(ZZ, 100, range(100), 1, 1)
p1000 = Path(ZZ, 1000, range(1000), 1, 1)
q5 = Path(ZZ, 5, range(1,6), 1, 1)
q25 = Path(ZZ, 25, range(1,26), 1, 1)
q50 = Path(ZZ, 50, range(1,51), 1, 1)
q100 = Path(ZZ, 100, range(1,101), 1, 1)
q1000 = Path(ZZ, 1000, range(1,1001), 1, 1)
r5 = Path(ZZ, 5, range(1,5)+[0], 1, 1)
r25 = Path(ZZ, 25, range(1,25)+[0], 1, 1)
r50 = Path(ZZ, 50, range(1,50)+[0], 1, 1)
r100 = Path(ZZ, 100, range(1,100)+[0], 1, 1)
r1000 = Path(ZZ, 1000, range(1,1000)+[0], 1, 1)
s5 = Path(ZZ, 5, range(5), 1, 1)
s25 = Path(ZZ, 25, range(25), 1, 1)
s50 = Path(ZZ, 50, range(50), 1, 1)
s100 = Path(ZZ, 100, range(100), 1, 1)
s1000 = Path(ZZ, 1000, range(1000), 1, 1)

We are doing the following tests.

Concatenation:

%timeit p5*q5
%timeit p5*q25
%timeit p25*q25
%timeit p25*q50
%timeit p50*q50
%timeit p50*q100
%timeit p100*q100
%timeit p100*q1000
%timeit p1000*q1000

Hash:

%timeit hash(p5)
%timeit hash(p25)
%timeit hash(p50)
%timeit hash(p100)
%timeit hash(p1000)

Comparison:

equal

%timeit p5==s5
%timeit p25==s25
%timeit p50==s50
%timeit p100==s100
%timeit p1000==s1000

obviously different (first item differs)

%timeit p5==r5
%timeit p25==r25
%timeit p50==r50
%timeit p100==r100
%timeit p1000==r1000

less obviously different (only the last item differs)

%timeit q5==r5
%timeit q25==r25
%timeit q50==r50
%timeit q100==r100
%timeit q1000==r1000

Startswith:

%timeit q1000.startswith(q100)  # it does start with
%timeit q1000.startswith(r100)  # it doesn't start with

Uncompressed integer lists

Here, we define Path=Path_v1 (see attachment: path_test.pyx).

Using ctypedef unsigned long block

sage: %timeit p5*q5
1000000 loops, best of 3: 739 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 769 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 801 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 810 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 885 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 959 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 1.02 us per loop
sage: %timeit p100*q1000
100000 loops, best of 3: 1.99 us per loop
sage: %timeit p1000*q1000
100000 loops, best of 3: 3.06 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 135 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 166 ns per loop
sage: %timeit hash(p50)
1000000 loops, best of 3: 214 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 333 ns per loop
sage: %timeit hash(p1000)
100000 loops, best of 3: 2.09 us per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 1.63 us per loop
sage: %timeit p25==s25
100000 loops, best of 3: 5.91 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 11 us per loop
sage: %timeit p100==s100
10000 loops, best of 3: 23 us per loop
sage: %timeit p1000==s1000
1000 loops, best of 3: 212 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 634 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 659 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 659 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 686 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 698 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 1.65 us per loop
sage: %timeit q25==r25
100000 loops, best of 3: 5.83 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 11.5 us per loop
sage: %timeit q100==r100
10000 loops, best of 3: 22 us per loop
sage: %timeit q1000==r1000
1000 loops, best of 3: 213 us per loop
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 325 ns per loop
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 311 ns per loop

Using ctypedef unsigned short block

sage: %timeit p5*q5
1000000 loops, best of 3: 700 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 738 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 820 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 769 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 808 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 877 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 909 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.49 us per loop
sage: %timeit p1000*q1000
100000 loops, best of 3: 2.13 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 138 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 179 ns per loop
sage: %timeit hash(p50)
1000000 loops, best of 3: 236 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 383 ns per loop
sage: %timeit hash(p1000)
100000 loops, best of 3: 2.61 us per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 1.08 us per loop
sage: %timeit p25==s25
100000 loops, best of 3: 3.42 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 6.53 us per loop
sage: %timeit p100==s100
100000 loops, best of 3: 14.9 us per loop
sage: %timeit p1000==s1000
10000 loops, best of 3: 137 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 613 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 591 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 602 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 603 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 635 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 1.15 us per loop
sage: %timeit q25==r25
100000 loops, best of 3: 3.63 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 6.45 us per loop
sage: %timeit q100==r100
100000 loops, best of 3: 12.8 us per loop
sage: %timeit q1000==r1000
10000 loops, best of 3: 140 us per loop
sage: q1000.startswith(q100)
True
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 336 ns per loop
sage: q1000.startswith(r100)
False
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 325 ns per loop

Using ctypedef unsigned char block

sage: %timeit p5*q5
1000000 loops, best of 3: 679 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 693 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 725 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 760 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 772 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 761 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 790 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.1 us per loop
sage: %timeit p1000*q1000
1000000 loops, best of 3: 1.31 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 133 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 176 ns per loop
sage: %timeit hash(p50)
1000000 loops, best of 3: 234 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 374 ns per loop
sage: %timeit hash(p1000)
100000 loops, best of 3: 2.53 us per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 1.08 us per loop
sage: %timeit p25==s25
100000 loops, best of 3: 3.36 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 6.3 us per loop
sage: %timeit p100==s100
100000 loops, best of 3: 12.5 us per loop
sage: %timeit p1000==s1000
10000 loops, best of 3: 121 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 581 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 589 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 599 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 599 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 636 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 1.06 us per loop
sage: %timeit q25==r25
100000 loops, best of 3: 3.46 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 6.51 us per loop
sage: %timeit q100==r100
100000 loops, best of 3: 12.7 us per loop
sage: %timeit q1000==r1000
10000 loops, best of 3: 121 us per loop
sage: q1000.startswith(q100)
True
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 324 ns per loop
sage: %timeit q1000.startswith(r100) 
1000000 loops, best of 3: 315 ns per loop

Compressed integer lists

Here, we define Path=Path_v2 (see attachment: path_test.pyx), and do

sage: SizeManager.set_edge_number(27)

so that there is a total of only 27 edges (resp. 27 is the maximal number of outgoing arrows on a vertex). In particular, all integers in the sequences below are taken "mod 27". Since 27 has length 5 bits, a compressed representation using unsigned char* won't make sense. Hence, I am only testing long and short.

Using ctypedef unsigned long block

sage: %timeit p5*q5
1000000 loops, best of 3: 676 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 671 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 683 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 725 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 759 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 787 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 811 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.44 us per loop
sage: %timeit p1000*q1000
1000000 loops, best of 3: 1.61 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 132 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 134 ns per loop
sage: %timeit hash(p50)
10000000 loops, best of 3: 142 ns per loop
sage: %timeit hash(p100)
10000000 loops, best of 3: 155 ns per loop
sage: %timeit hash(p1000)
1000000 loops, best of 3: 466 ns per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 710 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 1.71 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 2.6 us per loop
sage: %timeit p100==s100
100000 loops, best of 3: 4.4 us per loop
sage: %timeit p1000==s1000
10000 loops, best of 3: 38 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 692 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 715 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 724 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 729 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 752 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 706 ns per loop
sage: %timeit q25==r25
1000000 loops, best of 3: 1.69 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 2.59 us per loop
sage: %timeit q100==r100
100000 loops, best of 3: 4.4 us per loop
sage: %timeit q1000==r1000
10000 loops, best of 3: 37.6 us per loop
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 272 ns per loop
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 257 ns per loop

Using ctypedef unsigned short block

sage: %timeit p5*q5
1000000 loops, best of 3: 718 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 682 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 710 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 740 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 722 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 781 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 858 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.98 us per loop
sage: %timeit p1000*q1000
100000 loops, best of 3: 2.17 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 132 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 148 ns per loop
sage: %timeit hash(p50)
10000000 loops, best of 3: 168 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 206 ns per loop
sage: %timeit hash(p1000)
1000000 loops, best of 3: 951 ns per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 723 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 1.77 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 2.93 us per loop
sage: %timeit p100==s100
100000 loops, best of 3: 5.32 us per loop
sage: %timeit p1000==s1000
10000 loops, best of 3: 47.5 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 577 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 586 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 599 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 588 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 607 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 754 ns per loop
sage: %timeit q25==r25
1000000 loops, best of 3: 1.87 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 2.89 us per loop
sage: %timeit q100==r100
100000 loops, best of 3: 5.28 us per loop
sage: %timeit q1000==r1000
10000 loops, best of 3: 47.3 us per loop
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 272 ns per loop
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 256 ns per loop

simon-king-jena commented 10 years ago

comment:5

It turns out that there are bugs in the toy code. But that's no surprise...

simon-king-jena commented 10 years ago

comment:6

I have fixed the problems in attachment: path_test.pyx.

It was:

bitshift by a wrong offset, and
the bits that aren't used (e.g., the last 4 bits of an uint64_t when storing 12 arrows of size 5 bits) should be kept blank.

The latter involves an additional bitwise "and". So, probably the timings will slightly deteriorate.

simon-king-jena commented 10 years ago

comment:7

There is one other variation that I want to test.

Currently, if the number of edges fits into 5 bits, then a 64 bit word is filled with data of 12 edges, plus 4 bits of garbage. The garbage must always be zero bits, for otherwise the shift operations would pollute real data with the garbage bits. To keep the garbage zero involves additional operations.

Alternatively, one could try to put fewer arrows into one word, so that there are no garbage bits. This could be done by encoding each arrow by a number of bits that is a power of 2. Hence, in the setting above, one would encode each edge by 8 rather than 5 bits, fitting 8 arrows into one 64 bit word. It is less dense, however the code gets simpler and should have less overhead.

simon-king-jena commented 10 years ago

comment:8

I have replaced attachment: path_test.pyx again, adding two more implementations.

Do you have further suggestions on storage of lists of bounded integers? If you haven't, then I'll try to assess the timings stated in the various comments, according to what I expect to be useful in my applications.

Here are details on the two new implementations:

In Path_v3, I use Sage's Integer class as bit list. Sounds crazy, but it works rather well. I assume that bitshifts and comparison (and that's what we are using here) are internally written in assembler, and for us, the additional benefit is that we do not need to take care about the size of blocks (8 bit? 32 bit? 64 bit? 128 bit?) in which we store stuff---Integer does it for us. Perhaps the timings could be improved further by using the underlying c-library (GMP's mpz_t) directly.

In Path_v4, I provide a slightly simplified version of Path_v2: The data is compressed, but so that the chunks used to store one arrow fit seamlessly into a memory block. Hence, when we use 64-bit blocks, and have 27 arrows, then Path_v2 would store the arrows in chunks of 5 bit (hence, it can store 12 arrows in one memory block, leaving 4 bit empty), whereas Path_v4 would store the arrows in chunks of 8 bit (hence, it can only store 8 arrows in one memory block, but since this fits neatly into one memory block, the code can be simplified).

The timings for Path_v3:

sage: %timeit p5*q5
1000000 loops, best of 3: 645 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 631 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 635 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 1.23 us per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 1.82 us per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 1.83 us per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 1.37 us per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.82 us per loop
sage: %timeit p1000*q1000
1000000 loops, best of 3: 1.78 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 164 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 197 ns per loop
sage: %timeit hash(p50)
1000000 loops, best of 3: 210 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 274 ns per loop
sage: %timeit hash(p1000)
1000000 loops, best of 3: 1.94 us per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 667 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 638 ns per loop
sage: %timeit p50==s50
1000000 loops, best of 3: 693 ns per loop
sage: %timeit p100==s100
1000000 loops, best of 3: 658 ns per loop
sage: %timeit p1000==s1000
1000000 loops, best of 3: 817 ns per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 746 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 723 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 769 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 731 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 733 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 710 ns per loop
sage: %timeit q25==r25
1000000 loops, best of 3: 733 ns per loop
sage: %timeit q50==r50
1000000 loops, best of 3: 718 ns per loop
sage: %timeit q100==r100
1000000 loops, best of 3: 719 ns per loop
sage: %timeit q1000==r1000
1000000 loops, best of 3: 755 ns per loop
sage: %timeit q1000.startswith(q100)
100000 loops, best of 3: 2.46 us per loop
sage: %timeit q1000.startswith(r100)
100000 loops, best of 3: 2.45 us per loop

The timings for Path_v4:

sage: %timeit p5*q5
1000000 loops, best of 3: 730 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 762 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 806 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 849 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 854 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 931 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 930 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.97 us per loop
sage: %timeit p1000*q1000
1000000 loops, best of 3: 1.45 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 136 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 148 ns per loop
sage: %timeit hash(p50)
10000000 loops, best of 3: 159 ns per loop
sage: %timeit hash(p100)
10000000 loops, best of 3: 182 ns per loop
sage: %timeit hash(p1000)
1000000 loops, best of 3: 600 ns per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 700 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 1.65 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 2.48 us per loop
sage: %timeit p100==s100
100000 loops, best of 3: 4.21 us per loop
sage: %timeit p1000==s1000
10000 loops, best of 3: 36.4 us per loop
sage: %timeit p5==r5
100000 loops, best of 3: 706 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 764 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 740 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 742 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 822 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 693 ns per loop
sage: %timeit q25==r25
1000000 loops, best of 3: 1.6 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 2.54 us per loop
sage: %timeit q100==r100
100000 loops, best of 3: 4.27 us per loop
sage: %timeit q1000==r1000
10000 loops, best of 3: 37.1 us per loop
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 255 ns per loop
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 243 ns per loop

simon-king-jena commented 10 years ago

comment:9

Hmmm. Perhaps I should have a look at sage/misc/bitset.pxi.

simon-king-jena commented 10 years ago

comment:10

I couldn't directly use sage/misc/bitset, but I could learn from it: I have improved the Path_v4 implementation (here, the arrows will be encoded by chunks of memory whose size is a power of 2). This was possible by using memcmp in some place (not in the method startswith(), where a loop over the data is faster than memcmp) and, in particular, use bitshift operations to replace multiplications and integer divisions (this is only possible with Path_v4, since here the factors are powers of 2 and can thus correspond to shifts).

New timings with Path=Path_v4 and SizeManager.set_edge_number(27):

sage: %timeit p5*q5
1000000 loops, best of 3: 684 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 726 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 798 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 862 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 827 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 924 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 915 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.91 us per loop
sage: %timeit p1000*q1000
1000000 loops, best of 3: 1.42 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 148 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 158 ns per loop
sage: %timeit hash(p50)
10000000 loops, best of 3: 171 ns per loop
sage: %timeit hash(p100)
10000000 loops, best of 3: 194 ns per loop
sage: %timeit hash(p1000)
1000000 loops, best of 3: 594 ns per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 498 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 547 ns per loop
sage: %timeit p50==s50
1000000 loops, best of 3: 631 ns per loop
sage: %timeit p100==s100
1000000 loops, best of 3: 712 ns per loop
sage: %timeit p1000==s1000
100000 loops, best of 3: 2.42 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 499 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 514 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 549 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 526 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 545 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 506 ns per loop
sage: %timeit q25==r25
1000000 loops, best of 3: 544 ns per loop
sage: %timeit q50==r50
1000000 loops, best of 3: 589 ns per loop
sage: %timeit q100==r100
1000000 loops, best of 3: 704 ns per loop
sage: %timeit q1000==r1000
100000 loops, best of 3: 2.48 us per loop
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 263 ns per loop
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 232 ns per loop

simon-king-jena commented 10 years ago

comment:11

PS: I chose 27 edges, since this should be particularly bad for Path_v4. Indeed, in Path_v2, 12 arrows can be put into one uint64_t, whereas in Path_v4 only 8 arrows fit into one uint64_t. So, if Path_v4 turns out to be faster than Path_v2 with 27 edges, then it is likely to be generally faster, I think.

simon-king-jena commented 10 years ago

comment:12

From my side, this is the last part of the proof-of-concept implementations, before assessing the benchmark tests and choosing an implementation for the elements of path semigroups...

In Path_v5, I am using the GMP library directly, thus avoiding some overhead compared with Path_v3. The timings I get (with Path=Path_v5 and SizeManager.set_edge_number(27)):

sage: %timeit p5*q5
1000000 loops, best of 3: 720 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 814 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 823 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 808 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 828 ns per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 890 ns per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 866 ns per loop
sage: %timeit p100*q1000
1000000 loops, best of 3: 1.18 us per loop
sage: %timeit p1000*q1000
1000000 loops, best of 3: 1.38 us per loop
sage: %timeit hash(p5)
10000000 loops, best of 3: 159 ns per loop
sage: %timeit hash(p25)
10000000 loops, best of 3: 171 ns per loop
sage: %timeit hash(p50)
1000000 loops, best of 3: 198 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 266 ns per loop
sage: %timeit hash(p1000)
1000000 loops, best of 3: 1.93 us per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 466 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 509 ns per loop
sage: %timeit p50==s50
1000000 loops, best of 3: 495 ns per loop
sage: %timeit p100==s100
1000000 loops, best of 3: 485 ns per loop
sage: %timeit p1000==s1000
1000000 loops, best of 3: 693 ns per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 451 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 468 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 463 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 468 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 482 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 448 ns per loop
sage: %timeit q25==r25
1000000 loops, best of 3: 462 ns per loop
sage: %timeit q50==r50
1000000 loops, best of 3: 448 ns per loop
sage: %timeit q100==r100
1000000 loops, best of 3: 457 ns per loop
sage: %timeit q1000==r1000
1000000 loops, best of 3: 470 ns per loop
sage: %timeit q1000.startswith(q100)
1000000 loops, best of 3: 673 ns per loop
sage: %timeit q1000.startswith(r100)
1000000 loops, best of 3: 657 ns per loop

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:13

Hmmmmmmm... Well, looks like the last one is the best, even though it is a bit slow for hashing. If the sizeof a block is a power of two it should be quite good, don't you think ?

This being said, perhaps it should be implemented like the bitset stuff, in a algebraic-free file, to store sequences of integers as you said.

How do you define your hashing function and your comparison function, by the way ?

Nathann

simon-king-jena commented 10 years ago

comment:14

Replying to @nathanncohen:

Hmmmmmmm... Well, looks like the last one is the best, even though it is a bit slow for hashing.

What do you mean by "the last one"? Using GMP's mpz_tdirectly, which is Path_v5 in the attachment? Or the improved "semicompressed" version of uint64_t* (storing chunks of size 2^n rather then in the smallest possible chunk size), which is Path_v4?

If the sizeof a block is a power of two it should be quite good, don't you think ?

Should be. My own summary of the timings above would be this. We have 6 benchmarks, and in the following list I am giving for each benchmark a small, a medium and a large example. In the end, + or - indicate whether the respective implementation did well/poorly in the three examples.

Concatenation

                     5x5   50x50  1000x1000
Uncompressed long : 739 ns  885 ns 3.06 us  - -
Uncompressed short: 700 ns  808 ns 2.13 us
Uncompressed char : 679 ns  772 ns 1.31 us    +
Compressed   long : 676 ns  759 ns 1.61 us
Compressed   short: 718 ns  722 ns 1.98 us   +
Semicompr.   long : 684 ns  827 ns 1.42 us
Integer           : 645 ns 1.82 us 1.78 us  +-
GMP          mpz_t: 723 ns  799 ns 1.41 us

Hash

                     5      100     1000
Uncompressed long : 135 ns 333 ns  2.09 us  +
Uncompressed short: 138 ns 383 ns  2.61 us  +--
Uncompressed char : 133 ns 374 ns  2.53 us  +
Compressed   long : 132 ns 155 ns   466 ns  +++
Compressed   short: 132 ns 206 ns   951 ns  +
Semicompr.   long : 148 ns 194 ns   594 ns
Integer           : 164 ns 274 ns  1.94 us  -
GMP          mpz_t: 159 ns 266 ns  1.93 us

== for equal paths

                     5       100     1000
Uncompressed long : 1.63 us 23   us  212   us ---
Uncompressed short: 1.08 us 14.9 us  137   us
Uncompressed char : 1.08 us 12.5 us  121   us
Compressed   long :  710 ns  4.4 us   38   us
Compressed   short:  723 ns  5.3 us   47.5 us
Semicompr.   long :  498 ns  712 ns    2.4 us
Integer           :  667 ns  658 ns    817 ns
GMP          mpz_t:  466 ns  485 ns    693 ns +++

== for "easily" unequal paths

                     5       100     1000
Uncompressed long : 634 ns   686 ns   698 ns
Uncompressed short: 613 ns   603 ns   635 ns
Uncompressed char : 581 ns   599 ns   636 ns
Compressed   long : 692 ns   729 ns   752 ns  --
Compressed   short: 577 ns   588 ns   607 ns
Semicompr.   long : 499 ns   526 ns   545 ns
Integer           : 746 ns   731 ns   733 ns --
GMP          mpz_t: 451 ns   468 ns   482 ns +++

== for "difficult" unequal paths

                     5       100      1000
Uncompressed long : 1.65 us   22 us    213 us   ---
Uncompressed short: 1.15 us   12.8 us  140 us
Uncompressed char : 1.06 us   12.7 us  121 us
Compressed   long :  706 ns    4.4 us   37.6 us
Compressed   short:  754 ns    5.3 us   47.3 us
Semicompr.   long :  506 ns    704 ns   2.5 us
Integer           :  710 ns    719 ns   755 ns
GMP          mpz_t:  448 ns    457 ns   470 ns  +++

startswith (here we only have one example size, one where the answer is True, and one where the answer is False)

                     yes      no
Uncompressed long :  325 ns  311 ns
Uncompressed short:  336 ns  325 ns
Uncompressed char :  324 ns  315 ns
Compressed   long :  272 ns  257 ns  +
Compressed   short:  272 ns  256 ns  +
Semicompr.   long :  263 ns  232 ns  ++
Integer           : 2.46 us 2.45 us  --
GMP          mpz_t:  673 ns  657 ns

So, I don't think there is a clear winner in the competition.

This being said, perhaps it should be implemented like the bitset stuff, in a algebraic-free file, to store sequences of integers as you said.

Yes, it would make sense.

How do you define your hashing function and your comparison function, by the way ?

Depends on the implementation. Of course, when I use Integer or mpz_t, I am using the hash function provided by GMP. When I implement a block*, where block is either char, uint32_t or uint64_t, I am using some hash function for lists that I found in the internet a while ago (I don't recall where). It is as follows:

    def __hash__(self):
        cdef block h
        cdef block* ptr
        h = <block>(self.start^(self.end<<16)^(self._len))
        ptr = <block *>self.data
        cdef size_t i
        h ^= FNV_offset
        for i from 0<=i<self._len:
            h ^= ptr[i]
            h *= FNV_prime
        if h==-1:
            return -2
        return h

where FNV_prime and FNV_offset depend on the architecture. I am surprised that this is faster than GMP's hash function.

simon-king-jena commented 10 years ago

Description changed:

--- 
+++ 
@@ -1,4 +1,4 @@
 #12630 has introduced tools to compute with quiver algebras and quiver representations. It is all Python code, paths being internally stored by lists of labelled edges of digraphs.

-The aim of this ticket is to provide a more efficient implementation of the (partial) semigroup that is formed by the paths of a quiver, multiplied by concatenation.
+The aim of this ticket is to provide an efficient implementation of lists of bounded integers, with concatenation. This can be used to implement paths of a quiver more efficiently, but can also have different applications (say, elements of free groups, or other applications in sage-combinat).

simon-king-jena commented 10 years ago

comment:16

As Nathann has mentioned, it would make sense to provide a general implementation of sequences of bounded integers (independent of quiver paths), and then quiver paths can inherit from it later. That's why I changed the ticket description.

simon-king-jena commented 10 years ago

Changed keywords from quiver path algebra efficiency to sequence bounded integer

simon-king-jena commented 10 years ago

Changed dependencies from #12630 to none

simon-king-jena commented 10 years ago

Attachment: path_test.pyx.gz

A proof of concept

simon-king-jena commented 10 years ago

comment:18

For completeness, I have added Path_v6, which uses Python's tuple under the hood (which is an obvious way to implement paths, too).

Result:

sage: %timeit p5*q5
1000000 loops, best of 3: 439 ns per loop
sage: %timeit p5*q25
1000000 loops, best of 3: 542 ns per loop
sage: %timeit p25*q25
1000000 loops, best of 3: 612 ns per loop
sage: %timeit p25*q50
1000000 loops, best of 3: 869 ns per loop
sage: %timeit p50*q50
1000000 loops, best of 3: 1.1 us per loop
sage: %timeit p50*q100
1000000 loops, best of 3: 1.39 us per loop
sage: %timeit p100*q100
1000000 loops, best of 3: 1.41 us per loop
sage: %timeit p100*q1000
100000 loops, best of 3: 5.04 us per loop
sage: %timeit p1000*q1000
100000 loops, best of 3: 11 us per loop
sage: %timeit hash(p5)
1000000 loops, best of 3: 245 ns per loop
sage: %timeit hash(p25)
1000000 loops, best of 3: 532 ns per loop
sage: %timeit hash(p50)
1000000 loops, best of 3: 884 ns per loop
sage: %timeit hash(p100)
1000000 loops, best of 3: 1.63 us per loop
sage: %timeit hash(p1000)
100000 loops, best of 3: 14.6 us per loop
sage: %timeit p5==s5
1000000 loops, best of 3: 885 ns per loop
sage: %timeit p25==s25
1000000 loops, best of 3: 1.71 us per loop
sage: %timeit p50==s50
100000 loops, best of 3: 2.63 us per loop
sage: %timeit p100==s100
100000 loops, best of 3: 4.53 us per loop
sage: %timeit p1000==s1000
10000 loops, best of 3: 39.3 us per loop
sage: %timeit p5==r5
1000000 loops, best of 3: 785 ns per loop
sage: %timeit p25==r25
1000000 loops, best of 3: 829 ns per loop
sage: %timeit p50==r50
1000000 loops, best of 3: 799 ns per loop
sage: %timeit p100==r100
1000000 loops, best of 3: 791 ns per loop
sage: %timeit p1000==r1000
1000000 loops, best of 3: 822 ns per loop
sage: %timeit q5==r5
1000000 loops, best of 3: 1.48 us per loop
sage: %timeit q25==r25
100000 loops, best of 3: 3.72 us per loop
sage: %timeit q50==r50
100000 loops, best of 3: 6.63 us per loop
sage: %timeit q100==r100
100000 loops, best of 3: 12.5 us per loop
sage: %timeit q1000==r1000
10000 loops, best of 3: 116 us per loop
sage: %timeit q1000.startswith(q100)
100000 loops, best of 3: 4.7 us per loop
sage: %timeit q1000.startswith(r100)
100000 loops, best of 3: 4.72 us per loop

The comparison with the other implementations becomes this:

Concatenation

                     5x5   50x50  1000x1000
Uncompressed long : 739 ns  885 ns 3.06 us  - -
Uncompressed short: 700 ns  808 ns 2.13 us
Uncompressed char : 679 ns  772 ns 1.31 us    +
Compressed   long : 676 ns  759 ns 1.61 us
Compressed   short: 718 ns  722 ns 1.98 us   +
Semicompr.   long : 684 ns  827 ns 1.42 us
Integer           : 645 ns 1.82 us 1.78 us   -
GMP          mpz_t: 723 ns  799 ns 1.41 us
tuple             : 439 ns  1.1 us 11   us  + -

Hash

                     5       100     1000
Uncompressed long : 135 ns  333 ns  2.09 us  +
Uncompressed short: 138 ns  383 ns  2.61 us  +--
Uncompressed char : 133 ns  374 ns  2.53 us  +
Compressed   long : 132 ns  155 ns   466 ns  +++
Compressed   short: 132 ns  206 ns   951 ns  +
Semicompr.   long : 148 ns  194 ns   594 ns
Integer           : 164 ns  274 ns  1.94 us  
GMP          mpz_t: 159 ns  266 ns  1.93 us
tuple             : 245 ns 1.63 us 14.6  us  ---

== for equal paths

                     5       100     1000
Uncompressed long : 1.63 us 23   us  212   us ---
Uncompressed short: 1.08 us 14.9 us  137   us
Uncompressed char : 1.08 us 12.5 us  121   us
Compressed   long :  710 ns  4.4 us   38   us
Compressed   short:  723 ns  5.3 us   47.5 us
Semicompr.   long :  498 ns  712 ns    2.4 us
Integer           :  667 ns  658 ns    817 ns
GMP          mpz_t:  466 ns  485 ns    693 ns +++
tuple             :  885 ns  4.5 us   39.3 us

== for "easily" unequal paths

                     5       100     1000
Uncompressed long : 634 ns   686 ns   698 ns
Uncompressed short: 613 ns   603 ns   635 ns
Uncompressed char : 581 ns   599 ns   636 ns
Compressed   long : 692 ns   729 ns   752 ns 
Compressed   short: 577 ns   588 ns   607 ns
Semicompr.   long : 499 ns   526 ns   545 ns
Integer           : 746 ns   731 ns   733 ns 
GMP          mpz_t: 451 ns   468 ns   482 ns +++
tuple             : 785 ns   791 ns   822 ns ---

== for "difficult" unequal paths

                     5       100      1000
Uncompressed long : 1.65 us   22   us  213 us   ---
Uncompressed short: 1.15 us   12.8 us  140 us
Uncompressed char : 1.06 us   12.7 us  121 us
Compressed   long :  706 ns    4.4 us   37.6 us
Compressed   short:  754 ns    5.3 us   47.3 us
Semicompr.   long :  506 ns    704 ns   2.5 us
Integer           :  710 ns    719 ns   755 ns
GMP          mpz_t:  448 ns    457 ns   470 ns  +++
tuple             : 1.48 us   12.5 us  116 us

startswith

                     yes      no
Uncompressed long :  325 ns  311 ns
Uncompressed short:  336 ns  325 ns
Uncompressed char :  324 ns  315 ns
Compressed   long :  272 ns  257 ns  +
Compressed   short:  272 ns  256 ns  +
Semicompr.   long :  263 ns  232 ns  ++
Integer           : 2.46 us 2.45 us  
GMP          mpz_t:  673 ns  657 ns
tuple             : 4.7  us 4.72 us  --

So, using tuples is best for concatenation of short paths, but that's the only case where it is not too slow.

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:19

Well, given the outcomes I guess it mostly depends on your own practical applications ?... :-)

I also wonder (because I hate them) how much time is spent dealing with parents. But all your implementations have to go through that at the moment so it does not change your comparisons.

Well, I am back to work at long long last. With a computer AND a screen. And in Brussels, which counts for the good mood :-P

Have fuuuuuuuuuuuuuun !

Nathann

simon-king-jena commented 10 years ago

comment:20

Hi Nathann,

Replying to @nathanncohen:

I also wonder (because I hate them) how much time is spent dealing with parents.

Well, at some point parents have to come into play, namely when I want to implement not just sequences of bounded integers, but elements of the path semigroup of a quiver.

The strategical question is: Should the data defining a path be stored in an attribute of that path (the attribute would be an instance of BoundedIntegerSequence), or should the path itself be an instance of BoundedIntegerSequence? In the former case, it would be perfectly fine to have sequences of bounded integers parent-less; but I worry about an overhead of accessing the attribute. In the latter case, BoundedIntegerSequence should already be a sub-class of sage.structure.element.Element (and have a parent), since Cython does not allow multiple inheritance.

I tend to the latter solution also for a different reason: Instances of BoundedIntegerSequence have to know the bound for their elements, because the data storage depends on that bound. For efficiency (see attachment), several other constants are derived from that bound, and used in the algorithms.

Should each instance of BoundedIntegerSequence know about all these bounds? Or should there be a common parent for all sequences of integers that are bounded by a number B?

I tend to the latter. If you assign several ints (the afore-mentioned constants) to an instance of BoundedIntegerSequence, it will likely be more time consuming (during initialisation) than to store just a single data (namely the parent).

That said, looking up the constants via the parent would be more time-consuming than looking up constants that are stored directly in the element.

Difficult....

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:21

Hellooooo Simon !

Hmmmm... All this is important to settle before implementing those BoundedIntegerSequences objects somewhere indeed.

What I thought you intended was to write a very low-level data structure somewhere in the misc/ folder. The kind of stuff that one does not use by mistake, the kind of stuff that ones does not use without reading the manual first. So I thought it would not exactly check its input, trusting blindly the developer and being as efficient as possible.

Then, I thought you would have a higher-level algebraic stuff with parents colors and blinking lights that would check their input indeed, and be what we want a user-friendly object to be.

But I actually have only one question: I don't know how you intend to use all this in the end, but I got the impression that several functions may have to generate a LOT of paths in order to compute some small data, and throw all the paths away once the computations have ended. Do you want do create high-level objects in those functions or not ?

If you don't, then perhaps you would be better with only low-level objects that you know how to use, which would not have to waste time (if time is actually wasted) on high-level problems.

And so I thought that you high-level objects would have a variable which would be this low-level object, etc, etc...

What do you have in mind ?

Nathann

simon-king-jena commented 10 years ago

comment:22

Hi Nathann,

Replying to @nathanncohen:

What I thought you intended was to write a very low-level data structure somewhere in the misc/ folder.

I somehow do.

The kind of stuff that one does not use by mistake,

Why? It should be a replacement for tuples (thus, low-level), but it should ideally be (close to) a drop-in replacement (thus, it doesn't matter whether one uses it by mistake).

Then, I thought you would have a higher-level algebraic stuff with parents colors and blinking lights that would check their input indeed, and be what we want a user-friendly object to be.

Again ideally: Blinking colours should only be added when one really implements paths in quivers.

But I actually have only one question: I don't know how you intend to use all this in the end, but I got the impression that several functions may have to generate a LOT of paths in order to compute some small data, and throw all the paths away once the computations have ended. Do you want do create high-level objects in those functions or not ?

Good point. So, you mean that these functions should actually not create fully-fledged quiver paths, but should just operate on instances of BoundedIntegerSequence, and only in the end interpret them as paths, when the battle smoke has vanished.

And so I thought that you high-level objects would have a variable which would be this low-level object, etc, etc...

Yes, probably that's better.

So, perhaps I should really do cdef class BoundedIntegerSequence(tuple) rather than cdef class BoundedIntegerSequence(MonoidElement).

But then, what to do with all these internally used constants that are needed even for the low-level functionality of BoundedIntegerSequence? Would you prefer to compute and store these constants 10000 times when creating 10000 sequences of integers bounded by B? Or would you prefer to compute and store these constants only once, namely in a parent that hosts all sequences of integers with a fixed bound?

Or do you have a third solution for storing/accessing these constants efficiently?

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:23

Helloooooo !!

Why? It should be a replacement for tuples (thus, low-level), but it should ideally be (close to) a drop-in replacement (thus, it doesn't matter whether one uses it by mistake).

Why do you want it to inherit from tuple ? I do not know the effect of such a thing, what it brings and what should be feared

Good point. So, you mean that these functions should actually not create fully-fledged quiver paths, but should just operate on instances of BoundedIntegerSequence, and only in the end interpret them as paths, when the battle smoke has vanished.

Well, yep. Only create the high-level stuff with possible overhead when you need it. And you end up knowing exactly what your code does when you are writing the code of your time-critical function.

So, perhaps I should really do cdef class BoundedIntegerSequence(tuple) rather than cdef class BoundedIntegerSequence(MonoidElement).

Hmmm.. I still don't get why you want it to inherit from tuple. My natural move would be to implement this as a C structure, not even a object :-P

But then, what to do with all these internally used constants that are needed even for the low-level functionality of BoundedIntegerSequence? Would you prefer to compute and store these constants 10000 times when creating 10000 sequences of integers bounded by B? Or would you prefer to compute and store these constants only once, namely in a parent that hosts all sequences of integers with a fixed bound?

Well, somehow that's already what we do with graph backends. We have a Generic Backend, extended twice in Dense Graphs and Sparse Graphs. And all the constants needed by the data structures are stored in the corresponding files. You could have a BoundedIntegerSequence class implementing no function at all, extended by alld ifferent implementations of the data structures you want. Even in the same file.

... And we don't need parents for that :-PPPP

Nathann

simon-king-jena commented 10 years ago

comment:24

Replying to @nathanncohen:

Why? It should be a replacement for tuples (thus, low-level), but it should ideally be (close to) a drop-in replacement (thus, it doesn't matter whether one uses it by mistake).

Why do you want it to inherit from tuple ?

By analogy to Sequence_generic (which inherits from list):

sage: S = Sequence([1,2,3])
sage: type(S)
<class 'sage.structure.sequence.Sequence_generic'>
sage: type(S).mro()
[sage.structure.sequence.Sequence_generic,
 sage.structure.sage_object.SageObject,
 list,
 object]

Well, somehow that's already what we do with graph backends. We have a Generic Backend, extended twice in Dense Graphs and Sparse Graphs. And all the constants needed by the data structures are stored in the corresponding files.

No, that's a different situation. If you have a sequence S1 of integers bounded by B1 and a sequence S2 of integers bounded by B2 then the constants for S1 are different from the constants of S2. So, it is not really a "global" constant that is shared by all instances of a certain data structure (and could thus be put into a file), but when you do, for example, for x in B1 then you'll need different constants than when you do for x in B2.

So, these constants are not shared by all sequences, but by all sequence that share the same bound. Thus, it seems natural (to me, at least) to have one object O(B) that is shared by all sequences of bound B, so that O(B) provides the constants that belong to B.

Sure, O(B) could be written in C as well. But why not making it a parent, when all what we do is letting each sequence have a pointer to O(B)?

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:25

By analogy to Sequence_generic (which inherits from list):

?...

Well. And why is that a good idea ? :-P

No, that's a different situation. If you have a sequence S1 of integers bounded by B1 and a sequence S2 of integers bounded by B2 then the constants for S1 are different from the constants of S2. So, it is not really a "global" constant that is shared by all instances of a certain data structure (and could thus be put into a file), but when you do, for example, for x in B1 then you'll need different constants than when you do for x in B2.

Hmmmm... I see. Well, isn't your only parameter the number of bits you need to store each entry ? If it is, did you decided whether you want to allow it to be arbitrary, or if you want it to be a power of 2 ?

If it is a power of 2, then perhaps templates are sufficient, i.e. "one data structure per value of this parameter". There are not so many powers of 2 between 1 and 64 that make sense to store something :-)

I'd say.... 4,8,16,32 ? :-)

So, these constants are not shared by all sequences, but by all sequence that share the same bound. Thus, it seems natural (to me, at least) to have one object O(B) that is shared by all sequences of bound B, so that O(B) provides the constants that belong to B.

If your data structure is not an object, you could also add this width parameter as an input of the functions you use ? I guess all Bounded Sequence Integer that you use in the same function would have the same bound.

But admittedly I love to split hairs. Tell me if you think that this is going too far :-P

Sure, O(B) could be written in C as well. But why not making it a parent, when all what we do is letting each sequence have a pointer to O(B)?

Ahahahah. That's up to you. I just hate Sage's parents/elements. I would live happy with a C pointer I can control, but not witha framework that I do not understand. If you understand it, then you know what you are doing. I don't :-P

Nathann

simon-king-jena commented 10 years ago

comment:26

Replying to @nathanncohen:

Hmmmm... I see. Well, isn't your only parameter the number of bits you need to store each entry ?

No. We have

cdef Integer NumberArrows
cdef Integer IntegerMask
cdef size_t ArrowBitSize, BigArrowBitSize
cdef block* bitmask = NULL
cdef block DataMask, BitMask
cdef size_t ArrowsPerBlock, BigArrowsPerBlock
cdef size_t ModBAPB, DivBAPB, TimesBABS, TimesBlockSize

For example, ModBAPB is used to replace n%BigArrowsPerBlock by a bit-wise and, which is noticeably faster.

If it is, did you decided whether you want to allow it to be arbitrary, or if you want it to be a power of 2 ?

Depending on the implementation...

If it is a power of 2, then perhaps templates are sufficient, i.e. "one data structure per value of this parameter". There are not so many powers of 2 between 1 and 64 that make sense to store something :-)

Isn't duplication of code supposed to smell?

If your data structure is not an object, you could also add this width parameter as an input of the functions you use ? I guess all Bounded Sequence Integer that you use in the same function would have the same bound.

Do I understand correctly: You suggest that we should not provide a class BoundedIntegerSequence that can be used interactively in Sage, but you suggest that we should just provide a couple of functions that operate on C structures (say, mpz_t or usigned long*) and can only be used in Cython code?

Then it would be better to revert this ticket to its original purpose: There would be...

... cdef functions operating on mpz_t resp. on unsigned long*,
... a Cython class Path using mpz_t resp. unsigned long* as a cdef attribute, using the afore mentioned functions to implement concatenation and iteration,
... a subsequent ticket, implementing an F5 style algorithm to compute standard bases, operating not with Path but with mpz_t* resp unsigned long** (which will be the case anyway).

But admittedly I love to split hairs. Tell me if you think that this is going too far :-P

No problem. I just want to know if people see the need to have a Cython implementation of sequences of bounded integers, for use in Python.

Ahahahah. That's up to you. I just hate Sage's parents/elements. I would live happy with a C pointer I can control, but not witha framework that I do not understand.

Making it something like a container rather than a fully fledged parent would be an option, from my perspective.

simon-king-jena commented 10 years ago

comment:27

Replying to @simon-king-jena:

Making it something like a container rather than a fully fledged parent would be an option, from my perspective.

... I mean, for BoundedIntegerSequence. Of course, Path must have a parent (the path semigroup of a quiver).

nthiery commented 10 years ago

comment:28

Just a random thought. Some issues here are of the same nature as what was discussed around ClonableElement and its subclasses. Maybe there is stuff to share with it, especially if one ends up inheriting from Element.

simon-king-jena commented 10 years ago

comment:29

I just had a look at ClonableElement, and one immediate thought is: Why are C-pointers allocated in __init__? That's clearly the purpose of __cinit__, in particular if said C-pointers are freed in __dealloc__.

Do we want to have sequences of bounded integers for use in Python? Then one possible way would be:

have some inline cdef functions that operate on certain (templated?) C data structures
use these in a sub-class of ClonableElement (so that we can have sequences of bounded integers used in python)
use these in a Path class, to be used in path semigroups.

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:30

Yooooooo !!!

No. We have

cdef Integer NumberArrows
cdef Integer IntegerMask
cdef size_t ArrowBitSize, BigArrowBitSize
cdef block* bitmask = NULL
cdef block DataMask, BitMask
cdef size_t ArrowsPerBlock, BigArrowsPerBlock
cdef size_t ModBAPB, DivBAPB, TimesBABS, TimesBlockSize

Hmmmm.. Can't they all be deduced from the type of implementation (compressed long, for instance) and the number of bits per entry then ?

For example, ModBAPB is used to replace n%BigArrowsPerBlock by a bit-wise and, which is noticeably faster.

Yepyep of course.

If it is, did you decided whether you want to allow it to be arbitrary, or if you want it to be a power of 2 ?

Depending on the implementation...

Okay....

If it is a power of 2, then perhaps templates are sufficient, i.e. "one data structure per value of this parameter". There are not so many powers of 2 between 1 and 64 that make sense to store something :-)

Isn't duplication of code supposed to smell?

Well, if it is a template the code is only implemented once. And it will generate several data structures at compile time indeed, but each of them will have a hardcoded constant, and you only implement one.

Do I understand correctly: You suggest that we should not provide a class BoundedIntegerSequence that can be used interactively in Sage, but you suggest that we should just provide a couple of functions that operate on C structures (say, mpz_t or usigned long*) and can only be used in Cython code?

That's what I had in mind, but you make decisions here, pick what you like.

It's a bit how bitsets are implemented. A C data structure, and an object wrapper on top of it. Or C Graphs, with a data structure on top of it. It is how Permutations should be implemented, too :-P

Well, you have a C object that you use when you want performance, and when you don't care you use the higher-level implementations.

Then it would be better to revert this ticket to its original purpose: There would be...

... cdef functions operating on mpz_t resp. on unsigned long*,

... a Cython class Path using mpz_t resp. unsigned long* as a cdef attribute, using the afore mentioned functions to implement concatenation and iteration,

... a subsequent ticket, implementing an F5 style algorithm to compute standard bases, operating not with Path but with mpz_t* resp unsigned long** (which will be the case anyway).

Well, I like this plan. What do you think ? Does it displease you in any way ?

No problem. I just want to know if people see the need to have a Cython implementation of sequences of bounded integers, for use in Python.

I do not see how I could use it in my code at the moment, but I like the idea of having such a data structure somewhere, ready to be used. Most importantly, if you want speed, I think that having it available at two different levels is a safe bet. You write more code, but you will know how computations are spent in your time-critical functions.

Nathann

simon-king-jena commented 10 years ago

comment:31

Hi Nathann,

Replying to @nathanncohen:

Hmmmm.. Can't they all be deduced from the type of implementation (compressed long, for instance) and the number of bits per entry then ?

Sure. And you want to do this same computation 10^7 times?

Isn't duplication of code supposed to smell?

Well, if it is a template the code is only implemented once. And it will generate several data structures at compile time indeed, but each of them will have a hardcoded constant, and you only implement one.

Yes. Probably my problem is that I don't think "C++", but I think "Cython". And there it is a bit clumsy (I think templating is mimicked in Cython by including .pxi files, isn't it?).

That's what I had in mind, but you make decisions here, pick what you like.

It's a bit how bitsets are implemented.

Right, that's what I have looked at, and think makes sense here. Up to the constants.

Then it would be better to revert this ticket to its original purpose: There would be...

... cdef functions operating on mpz_t resp. on unsigned long*,

... a Cython class Path using mpz_t resp. unsigned long* as a cdef attribute, using the afore mentioned functions to implement concatenation and iteration,

... a subsequent ticket, implementing an F5 style algorithm to compute standard bases, operating not with Path but with mpz_t* resp unsigned long** (which will be the case anyway).

Well, I like this plan. What do you think ? Does it displease you in any way ?

No, it is one possibility.

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:33

Sure. And you want to do this same computation 10^7 times?

I don't get it. If the width is a power of two, then there are 6 possible values, i.e. 2^1, 2^2, 2^3, 2^4, 2^5, 2^6. Otherwise you have at most 64, but really who cares in the end ?

Yes. Probably my problem is that I don't think "C++", but I think "Cython". And there it is a bit clumsy (I think templating is mimicked in Cython by including .pxi files, isn't it?).

I am thinking of this : http://stackoverflow.com/questions/4840819/c-template-specialization-with-constant-value

That's C++, and I don't see how to do it in Cython, though. The point is that you can parameter a data structure with a constant value, and the data structure will be compiled for each value of the constant that appears in the code at compile-time (not all possible values, or course). The point is that the code is optimized while knowing the value(which can lead to optimization) and also that you can do things that would not be possible otherwise, like "cdef int a[k]" where k is a (templated) variable.

Nathann

simon-king-jena commented 10 years ago

comment:34

Replying to @nathanncohen:

Sure. And you want to do this same computation 10^7 times?

I don't get it. If the width is a power of two, then there are 6 possible values, i.e. 2^1, 2^2, 2^3, 2^4, 2^5, 2^6. Otherwise you have at most 64, but really who cares in the end ?

Apparently I am not used to templated code...

I was somehow thinking that you suggested to keep the bitlength of the upper bound for the integers as the only parameter, which means that you would need to compute all the other things either for each operation or (when storing it as attribute of the sequences) for each sequence.

I am thinking of this : http://stackoverflow.com/questions/4840819/c-template-specialization-with-constant-value

That's C++, and I don't see how to do it in Cython, though.

Part of the question is: Do we want to use gmp as backend? Or do we want to (re-)implement operations on densely packed long* by ourselves? Given the timings, gmp is an option.

Anyway, I'll have a look at the stackoverflow discussion. Thank you for the pointer!

The next question then is how to access the code from Cython. I've done this with C, but not C++.

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:35

Apparently I am not used to templated code...

I was somehow thinking that you suggested to keep the bitlength of the upper bound for the integers as the only parameter, which means that you would need to compute all the other things either for each operation or (when storing it as attribute of the sequences) for each sequence.

Nonono. Somehow, you can think of templates as a way to "script" the generation of code. You write some code with a constant k which is never defined anywhere, and then you say "compile one version of this code for each k in 1, 2, 3, 4, 5, 6. You end up with several data structures implemented, each with a hardcoded value. And then you can create an object cdef LongCompressedPath<5> my_path in which the hardcoded bound is 2^5. You just have many additional types, each with its own constant.

Part of the question is: Do we want to use gmp as backend? Or do we want to (re-)implement operations on densely packed long* by ourselves? Given the timings, gmp is an option.

Yep, you are right. Depends on how you weight the operations that you will need: it looks okay for concatenations and bad for startwith, but I expect that the most time consuming operation will be concatenation ? You are the one who needs it :-)

Anyway, I'll have a look at the stackoverflow discussion. Thank you for the pointer!

The next question then is how to access the code from Cython. I've done this with C, but not C++.

HMmmmmmm.... There is a C++ interface in numerical/backends/coin_backend.pyx but it does not play with templates. I am interested in knowing if this kind of templates can be written in pure Cython, do you mind if I ask the question on cython-devel (did you want to send one yourself) ?

Nathann

simon-king-jena commented 10 years ago

comment:37

Hi Nathann,

sorry for the long silence!

Replying to @nathanncohen:

HMmmmmmm.... There is a C++ interface in numerical/backends/coin_backend.pyx but it does not play with templates. I am interested in knowing if this kind of templates can be written in pure Cython, do you mind if I ask the question on cython-devel (did you want to send one yourself) ?

Did you find out how templated code can be written in Cython? If you have asked on cython-devel (so far I have only posted on cython-users) or have another pointer, then please tell me. If not, tell me also...

Cheers, Simon

simon-king-jena commented 10 years ago

comment:38

Hi Nathann,

wouldn't the following be close to templates?

In Cython:

cimport cython

ctypedef fused str_or_int:
    str
    int

cpdef str_or_int times_two(str_or_int var):
    return var+var

def show_me():
    cdef str a = 'a'
    cdef int b = 3
    print 'str', times_two(a)
    print 'int', times_two(b)

and this results in

sage: show_me()
str aa
int 6

simon-king-jena commented 10 years ago

comment:39

Or perhaps a better example, showing that the fused type is used to choose code at compile time: Put the following into a cell of the notebook

%cython
cimport cython

ctypedef fused str_or_int:
    str
    int
cpdef str_or_int times_two(str_or_int var):
    if str_or_int is int:
        return var*2
    else:
        return var+var
def test():
    cdef int a = 3
    cdef str b = 'b'
    a = times_two(a)
    b = times_two(b)

and look at the resulting code. The line a = times_two(a) becomes

 __pyx_v_a = __pyx_fuse_1__pyx_f_68_home_king__sage_sage_notebook_sagenb_home_admin_2_code_sage4_spyx_0_times_two(__pyx_v_a, 0);

whereas b = times_two(b) becomes

__pyx_t_1 = __pyx_fuse_0__pyx_f_68_home_king__sage_sage_notebook_sagenb_home_admin_2_code_sage4_spyx_0_times_two(__pyx_v_b, 0);

and one can check that Cython created two different C-functions __pyx_fuse_0... and __pyx_fuse_1... for the two possible (type-dependent!) versions of times_two.

simon-king-jena commented 10 years ago

comment:40

And here an example showing that it pays off, speed-wise:

In a .pyx file:

cimport cython

ctypedef fused str_or_int:
    str
    int

cpdef str_or_int times_two(str_or_int var):
    if str_or_int is int:
        return var+var
    else:
        return var+var

cpdef untyped_times_two(var):
    return var+var

cpdef int int_times_two(int var):
    return var+var

cpdef str str_times_two(str var):
    return var+var

def test1():
    cdef int a = 3
    cdef str b = 'b'
    a = times_two(a)
    b = times_two(b)

def test2():
    cdef int a = 3
    cdef str b = 'b'
    a = untyped_times_two(a)
    b = untyped_times_two(b)

def test3():
    cdef int a = 3
    cdef str b = 'b'
    a = int_times_two(a)
    b = str_times_two(b)

Attach the .pyx file to Sage, and get this:

age: %attach /home/king/Sage/work/templates/test.pyx
Compiling /home/king/Sage/work/templates/test.pyx...
sage: %timeit test1()
10000000 loops, best of 3: 159 ns per loop
sage: %timeit test2()
1000000 loops, best of 3: 199 ns per loop
sage: %timeit test3()
10000000 loops, best of 3: 168 ns per loop

EDIT: I have replaced the original example by something fairer, where it says var+var everywhere, but in the templated version Cython can choose the c-implementation of var+var at compile time, whereas it can't (and hence loses time when running the code) in the untemplated version.

EDIT 2: The new example also shows that choosing a specialised implementation manually is not faster than Cython using templates.

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:41

Hellooooooooooooooooo !!!

EDIT: I have replaced the original example by something fairer, where it says var+var everywhere, but in the templated version Cython can choose the c-implementation of var+var at compile time, whereas it can't (and hence loses time when running the code) in the untemplated version.

HMmmmmm... Does it make any difference ? gcc should resolve those "if" at compile-time, so if Cython creates two version of the function with an "if" inside depending on the type, gcc should not include it in the final binary as the constants involved can be defined at compile-time.

Anyway, that's a good news ! :-D

Nathann

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:42

Yooooooooooo !!

HMmmmmmm.... There is a C++ interface in numerical/backends/coin_backend.pyx but it does not play with templates. I am interested in knowing if this kind of templates can be written in pure Cython, do you mind if I ask the question on cython-devel (did you want to send one yourself) ?

Did you find out how templated code can be written in Cython? If you have asked on cython-devel (so far I have only posted on cython-users) or have another pointer, then please tell me. If not, tell me also...

Ahahah.. Well, no ... When I ask a question I do not do anything before I get an answer, soooooooooo I never did it :-P

I ask a question, then forget everything about it. When I get the answer I do what should be done :-P

Nathann

simon-king-jena commented 10 years ago

comment:43

Hi Nathann,

Replying to @nathanncohen:

HMmmmmm... Does it make any difference ? gcc should resolve those "if" at compile-time, so if Cython creates two version of the function with an "if" inside depending on the type, gcc should not include it in the final binary as the constants involved can be defined at compile-time.

Yes, it is indeed resolved at compile time. Two c-functions are created, and only the part of the code relevant for the respective code branch appears in the c-function.

In other words, I now plan to create cdef functions like this:

ctypedef seq_t:
    first_implementation
    second_implementation

cdef seq_t concat(seq_t x, seq_t y):
    if seq_t is first_implementation:
        <do stuff for 1st implementation>
    else:
        <do stuff for 2nd implementation>

and then, in other modules (perhaps also in src/sage/combinat/words/word_datatypes.pyx?), one could uniformly call concat(a,b), provided that the type of a and b are known at compile time (or write concat[first_implementation](a,b) explicitly).

I think that such Cython code will be sufficiently nice to read, and the resulting C-code will be sufficiently fast.

Anyway, that's a good news ! :-D

Indeed!

Simon

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:44

I think that such Cython code will be sufficiently nice to read, and the resulting C-code will be sufficiently fast.

Nathann

simon-king-jena commented 10 years ago

comment:45

Hmmm. Meanwhile I am a bit less positive about fused types. Let seq_t be a fused type for different implementations.

Fused types can not be used in iterators. I.e., the following is refused:
```
def iter_seq(seq_t S):
  for <bla>:
      <do something>
      yield x
```

Fused types can not be used as attributes of cdef classes. I.e., the following is refused:

cdef class BoundedIntSeq:
  cdef seq_t seq
  def __mul__(self, other):
      <do type check>
      cdef seq_t out = concat_seq(self.seq, other.seq)
      <create and return new instance of BoundedIntSeq using "out">

Hence, it seems that one needs to create two separate iterators, and it also seems that one needs code duplication. I.e., when impl1 and impl2 are two specialisations of the fused type seq_t, then one has to have

cdef class BoundedIntSeq_impl1:
    cdef impl1 seq
    def __mul__(self, other):
        ...
        cdef impl1 out = concat_seq(self.seq, other.seq)
cdef class BoundedIntSeq_impl2:
    cdef impl2 seq
    def __mul__(self, other):
        ...
        cdef impl2 out = concat_seq(self.seq, other.seq)

This mainly amounts to cut-and-paste, but is ugly!

I think I'll ask on cython-users if there are known ideas to solve this more elegantly.

simon-king-jena commented 10 years ago

comment:46

Some variation on the theme: One can create fused types from two cdef classes, and then one can work with specialisations of functions as follows:


ctypedef fused str_or_int:
    str
    int

# make the function do different things on different types,
# to better see the difference
cpdef str_or_int times_two(str_or_int var):
    if str_or_int is int:
        return var+var
    else:
        return var+'+'+var

# forward declaration of cdef classes and their fusion
cdef class Bar_int
cdef class Bar_str
cdef fused Bar:
    Bar_int
    Bar_str

# a function that can be specialised to either of the two extension classes
def twice(Bar self):
    # and inside, we specialised a function depending
    # on the type of an attribute --- at compile time!!
    return type(self)(times_two(self.data))

cdef class Bar_str:
    cdef str data
    def __init__(self, data):
        self.data = data
    def __repr__(self):
        return self.data
    # First possibility: Use the "templated" function as a method
    twice_ = twice[Bar_str]
    # second possibility: Wrap the "templated" function
    cpdef Bar_str twice__(self):
        return twice(self)

# The same, with the other Specialisation of <Bar>
cdef class Bar_int:
    cdef int data
    def __init__(self, data):
        self.data = data
    def __repr__(self):
        return "int(%d)"%self.data
    twice_ = twice[Bar_int]
    cpdef Bar_int twice__(self):
        return twice(self)

Time-wise, we see that wrapping the "templated" function is slower than using it as an attribute:

sage: b1 = Bar_str('abc')
sage: b1.twice_()
abc+abc
sage: b1.twice__()
abc+abc
sage: %timeit b1.twice_()
1000000 loops, best of 3: 759 ns per loop
sage: %timeit b1.twice__()
100000 loops, best of 3: 13.7 µs per loop
sage: b2 = Bar_int(5)
sage: b2.twice_()
int(10)
sage: b2.twice__()
int(10)
sage: %timeit b2.twice_()
1000000 loops, best of 3: 539 ns per loop
sage: %timeit b2.twice__()
100000 loops, best of 3: 9.2 µs per loop

Summary and further comments

We can reduce the code duplication to something like this:

<define templated function foo_ depending on a fused type Bar with specialisations Bar1, Bar2,...>
cdef class Bar1:
  foo = foo_[Bar1]
cdef class Bar2:
  foo = foo_[Bar2]
...

I hope this level of copy-and-paste can be afforded. At least, it gives good speed.

In my tests I found the following:
- It is impossible to make foo_ a cpdef function. Otherwise, the assignment foo = foo_ fails with the statement that foo_ can not be turned into a Python type.
- It is impossible to chose the same name for the templated function and for the method it is turned into. Doing
```
class Bar1:
    foo_ = foo_
```
results in an error saying that Bar1 has not attribute foo_.
- With the syntax above, both twice_ and twice__ have been put into the class' __dict__. I did not succeed to turn it into the equivalent of a cdef method.

simon-king-jena commented 10 years ago

comment:47

I did a lot of benchmark tests for the different implementations (using long* or char* or mpz_t to store sequences of bounded integers, using cdef functions that operate on fused types) and eventually found that using GMP (i.e., mpz_t) is fastest in all tests, by a wide margin.

Why did I not find this result before? Well, part of the reason is that GMP sometimes provides different functions to do the same job, and in the past few days I considerably improved my usage of the various possibilities. For example, mpz_t can be initialised in different ways. If one uses mpz_init, then the assignment that will follow results in a re-allocation. Since the length of the concatenation of two tuples is known in advance, prescribing the number of to-be-allocated bits with mpz_init2 resulted in a speed-up of concatenation, actually it became twice as fast.

And this result is not surprising. After all, GMP has to implement the same bitshift and comparison operations for bit arrays that I was implementing in the "long*" approach, too. But GMP probably uses implementation techniques that I can not even dream of.

Now, my plan is to

create some ctypedef struct bounded_int_tuple that is based on mpz_t and is the underlying C structure for tuples of bounded integers
provide a collection of cdef (inline) functions, operating on bounded_int_tuple, for concatenation, copying, comparison and so on
wrap it in a cdef class BoundedIntegerTuple. This will probably not derive from Element: I think it makes no difference from the point of view of memory efficiency whether each BoundedIntegerTuple stores a pointer to the parent of "all tuples of integers bounded by B" or directly stores B as an unsigned int.

In a subsequent ticket, I plan to use it for quiver paths, and if the combinat crowd cares, they can use it to make parts of sage.combinat.words a lot faster.

Comments? Requests? Suggestion of alternatives?

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:48

None from me. I keep being impressed at how much testing/researching you put into this quiver features... I really wonder what you will do with it in the end ;-)

Nathann

simon-king-jena commented 10 years ago

Branch: u/SimonKing/ticket/15820

simon-king-jena commented 10 years ago

comment:50

I have pushed a branch with an initial implementation of bounded integer sequences. I took care of error handling: I think you will find that those GMP commands that may result in a memory error are wrapped in sig_on()/sig_off(), and the cdef boilerplate functions do propagate errors. In fact, in some of my tests, I was hitting Ctrl-C, and it worked.

The rest of this post is about benchmarks. I compare against Python tuples---which is of course ambitious, and you will find that not in all situations tuples are slower than bounded integer sequences. However, one should see it positively: The bounded integer sequences implemented here have features very similar to tuples, and there are operations for which they are a lot faster than tuples. It all depends on the length and the bound of the sequence. That's why I think that it is a reasonable contribution. And to make it faster (without Python classes), one can still use the cdef boilerplate functions. Generally, it seems that long bounded integer sequences are faster than long tuples, but short tuples are faster than short sequences. For hash, bounded integer sequences are pretty good, but they suck in accessing items, or in slicing with step different from 1.

In contrast to tuples, bounded integer sequences also provide fairly quick tests for whether a sequences starts with a given sequence, or whether a sequence contains a certain sub-sequence. This is a feature I took from strings.

Here are the tests, with different sizes and bounds. Variable names starting with "T" hold tuples, those starting with "S" hold bounded integer sequences.

sage: from sage.structure.bounded_integer_sequences import BoundedIntegerSequence
sage: L0 = [randint(0,7) for i in range(5)]
sage: L1 = [randint(0,15) for i in range(15)]
sage: L2 = [randint(0,31) for i in range(50)]
sage: L3 = [randint(0,31) for i in range(5000)]
sage: T0 = tuple(L0); T1 = tuple(L1); T2 = tuple(L2); T3 = tuple(L3)
sage: S0 = BoundedIntegerSequence(8, L0)
sage: S1 = BoundedIntegerSequence(16, L1)
sage: S2 = BoundedIntegerSequence(32, L2)
sage: S3 = BoundedIntegerSequence(32, L3)

Iteration

sage: timeit("x=[y for y in T0]", number=1000000)
1000000 loops, best of 3: 783 ns per loop
sage: timeit("x=[y for y in T1]", number=1000000)
1000000 loops, best of 3: 1.48 µs per loop
sage: timeit("x=[y for y in T2]", number=100000)
100000 loops, best of 3: 4.3 µs per loop
sage: timeit("x=[y for y in T3]", number=1000)
1000 loops, best of 3: 245 µs per loop
sage: timeit("x=[y for y in S0]", number=1000000)
1000000 loops, best of 3: 2.38 µs per loop
sage: timeit("x=[y for y in S1]", number=100000)
100000 loops, best of 3: 4.1 µs per loop
sage: timeit("x=[y for y in S2]", number=100000)
100000 loops, best of 3: 10.1 µs per loop
sage: timeit("x=[y for y in S3]", number=1000)
1000 loops, best of 3: 1.79 ms per loop

Slicing

Bounded integer sequences are immutable and hence copied by identity. But let us do slicing, dropping the last item:

sage: timeit("x=T3[:-1]", number=100000)
100000 loops, best of 3: 19.5 µs per loop
sage: timeit("x=S3[:-1]", number=100000)
100000 loops, best of 3: 3.48 µs per loop

Slicing with step!=1 is much slower, though:

sage: timeit("x=T3[:-1:2]", number=100000)
100000 loops, best of 3: 11.7 µs per loop
sage: timeit("x=S3[:-1:2]", number=1000)
1000 loops, best of 3: 2.23 ms per loop

Perhaps I should mark it "TODO"...

Accessing single items

Short sequences

sage: timeit("x=T0[1]", number=1000000)
1000000 loops, best of 3: 361 ns per loop
sage: timeit("x=T0[4]", number=1000000)
1000000 loops, best of 3: 373 ns per loop
sage: timeit("x=S0[1]", number=1000000)
1000000 loops, best of 3: 959 ns per loop
sage: timeit("x=S0[4]", number=1000000)
1000000 loops, best of 3: 960 ns per loop

Large sequences (time also depends on the bound for the integer sequence!)

sage: timeit("x=T3[1]", number=1000000)
1000000 loops, best of 3: 359 ns per loop
sage: timeit("x=T3[4500]", number=1000000)
1000000 loops, best of 3: 382 ns per loop
sage: timeit("x=S3[1]", number=1000000)
1000000 loops, best of 3: 1.97 µs per loop
sage: timeit("x=S3[4500]", number=1000000)
1000000 loops, best of 3: 1.49 µs per loop

Comparison

Note that comparison of bounded integer sequences works different from comparison of tuples or lists, as detailed in the documentation.

We compare sequences that are equal but non-identical, that differ in early items, or that differ in late items.

sage: L0x = [randint(0,7) for i in range(5)]
sage: L1x = [randint(0,15) for i in range(15)]
sage: L2x = [randint(0,31) for i in range(50)]
sage: L3x = [randint(0,31) for i in range(5000)]  # verified that they differ from L0,L1,L2,L3
sage: T0x = tuple(L0x); T1x = tuple(L1x); T2x = tuple(L2x); T3x = tuple(L3x)
sage: S0x = BoundedIntegerSequence(8, L0x)
sage: S1x = BoundedIntegerSequence(16, L1x)
sage: S2x = BoundedIntegerSequence(32, L2x)
sage: S3x = BoundedIntegerSequence(32, L3x)
sage: T0y = tuple(L0); T1y = tuple(L1); T2y = tuple(L2); T3y = tuple(L3)
sage: S1y = BoundedIntegerSequence(8, L1)
sage: S2y = BoundedIntegerSequence(32, L2)
sage: S3y = BoundedIntegerSequence(32, L3)
sage: S1y==S1!=S1x, S1y is not S1
(True, True)

Early differences:

sage: timeit("T0==T0x", number=1000000)
1000000 loops, best of 3: 143 ns per loop
sage: timeit("T1==T1x", number=1000000)
1000000 loops, best of 3: 145 ns per loop
sage: timeit("T2==T2x", number=1000000)
1000000 loops, best of 3: 143 ns per loop
sage: timeit("T3==T3x", number=1000000)
1000000 loops, best of 3: 161 ns per loop
sage: timeit("S0==S0x", number=1000000)
1000000 loops, best of 3: 538 ns per loop
sage: timeit("S1==S1x", number=1000000)
1000000 loops, best of 3: 550 ns per loop
sage: timeit("S2==S2x", number=1000000)
1000000 loops, best of 3: 490 ns per loop
sage: timeit("S3==S3x", number=1000000)
1000000 loops, best of 3: 559 ns per loop

Equal sequences:

sage: timeit("T0==T0y", number=1000000)
1000000 loops, best of 3: 169 ns per loop
sage: timeit("T1==T1y", number=1000000)
1000000 loops, best of 3: 255 ns per loop
sage: timeit("T2==T2y", number=1000000)
1000000 loops, best of 3: 597 ns per loop
sage: timeit("T3==T3y", number=100000)
100000 loops, best of 3: 47.8 µs per loop
sage: timeit("S0==S0y", number=1000000)
1000000 loops, best of 3: 511 ns per loop
sage: timeit("S1==S1y", number=1000000)
1000000 loops, best of 3: 493 ns per loop
sage: timeit("S2==S2y", number=1000000)
1000000 loops, best of 3: 583 ns per loop
sage: timeit("S3==S3y", number=1000000)
1000000 loops, best of 3: 1.41 µs per loop

Late differences:

sage: T0z1 = T0+T0
sage: T0z2 = T0+T0x
sage: T1z1 = T1+T1
sage: T1z2 = T1+T1x
sage: T2z1 = T2+T2
sage: T2z2 = T2+T2x
sage: T3z1 = T3+T3
sage: T3z2 = T3+T3x
sage: timeit("T0z1==T0z2", number=100000)
100000 loops, best of 3: 206 ns per loop
sage: timeit("T1z1==T1z2", number=100000)
100000 loops, best of 3: 308 ns per loop
sage: timeit("T2z1==T2z2", number=100000)
100000 loops, best of 3: 640 ns per loop
sage: timeit("T3z1==T3z2", number=100000)
100000 loops, best of 3: 47.8 µs per loop
sage: S0z1 = S0+S0
sage: S0z2 = S0+S0x
sage: S1z1 = S1+S1
sage: S1z2 = S1+S1x
sage: S2z1 = S2+S2
sage: S2z2 = S2+S2x
sage: S3z1 = S3+S3
sage: S3z2 = S3+S3x
sage: timeit("S0z1==S0z2", number=100000)
100000 loops, best of 3: 585 ns per loop
sage: timeit("S1z1==S1z2", number=100000)
100000 loops, best of 3: 494 ns per loop
sage: timeit("S2z1==S2z2", number=100000)
100000 loops, best of 3: 555 ns per loop
sage: timeit("S3z1==S3z2", number=100000)
100000 loops, best of 3: 578 ns per loop

Hash

This is the only operation for which bounded integer sequences seem to be consistently faster than tuples:

sage: timeit("hash(T0)", number=100000)
100000 loops, best of 3: 179 ns per loop
sage: timeit("hash(T0)", number=1000000)
1000000 loops, best of 3: 177 ns per loop
sage: timeit("hash(T1)", number=1000000)
1000000 loops, best of 3: 267 ns per loop
sage: timeit("hash(T2)", number=1000000)
1000000 loops, best of 3: 584 ns per loop
sage: timeit("hash(T3)", number=100000)
100000 loops, best of 3: 45.3 µs per loop
sage: timeit("hash(S0)", number=1000000)
1000000 loops, best of 3: 145 ns per loop
sage: timeit("hash(S1)", number=1000000)
1000000 loops, best of 3: 153 ns per loop
sage: timeit("hash(S2)", number=1000000)
1000000 loops, best of 3: 194 ns per loop
sage: timeit("hash(S3)", number=1000000)
1000000 loops, best of 3: 8.97 µs per loop

Concatenation

sage: timeit("T0+T0", number=1000000)
1000000 loops, best of 3: 200 ns per loop
sage: timeit("T1+T1", number=1000000)
1000000 loops, best of 3: 311 ns per loop
sage: timeit("T2+T2", number=1000000)
1000000 loops, best of 3: 742 ns per loop
sage: timeit("T3+T3", number=10000)
10000 loops, best of 3: 40.3 µs per loop
sage: timeit("S0+S0", number=1000000)
1000000 loops, best of 3: 576 ns per loop
sage: timeit("S1+S1", number=1000000)
1000000 loops, best of 3: 590 ns per loop
sage: timeit("S2+S2", number=1000000)
1000000 loops, best of 3: 618 ns per loop
sage: timeit("S3+S3", number=1000000)
1000000 loops, best of 3: 2.17 µs per loop

Subsequences

Recall the definition of S0z1 etc. We find:

sage: S0z2.startswith(S0)
True
sage: S0z2.startswith(S0x)
False
sage: S0x in S0z2
True
sage: timeit("S0z2.startswith(S0)", number=1000000)
1000000 loops, best of 3: 239 ns per loop
sage: timeit("S0z2.startswith(S0x)", number=1000000)
1000000 loops, best of 3: 241 ns per loop
sage: timeit("S0x in S0z2", number=1000000)
1000000 loops, best of 3: 694 ns per loop
sage: timeit("S1z2.startswith(S1)", number=1000000)
1000000 loops, best of 3: 227 ns per loop
sage: timeit("S1z2.startswith(S1x)", number=1000000)
1000000 loops, best of 3: 223 ns per loop
sage: timeit("S1x in S1z2", number=1000000)
1000000 loops, best of 3: 1.08 µs per loop
sage: timeit("S2z2.startswith(S2)", number=1000000)
1000000 loops, best of 3: 247 ns per loop
sage: timeit("S2z2.startswith(S2x)", number=1000000)
1000000 loops, best of 3: 230 ns per loop
sage: timeit("S2x in S2z2", number=1000000)
1000000 loops, best of 3: 3.21 µs per loop
sage: timeit("S3z2.startswith(S3)", number=1000000)
1000000 loops, best of 3: 989 ns per loop
sage: timeit("S3z2.startswith(S3x)", number=1000000)
1000000 loops, best of 3: 218 ns per loop
sage: timeit("S3x in S3z2", number=1000)
1000 loops, best of 3: 3.57 ms per loop

I wonder if the latter could be improved. Another "TODO"...

New commits:

5dc78c5 Implement sequences of bounded integers