Add product/permutations/combinations from Python itertools

tcbrindle / flux

A C++20 library for sequence-orientated programming

https://tristanbrindle.com/flux/

Boost Software License 1.0

529 stars 31 forks source link

Add product/permutations/combinations from Python itertools #138

Open tcbrindle opened 11 months ago

tcbrindle commented 11 months ago

We currently have cartestian_product(seq1, seq2, ...) which takes a pack of N sequences and yields N-tuples of all the possible combinations of elements from those sequences. This is equivalent to a set of N nested for loops, and to one form of the Python itertools product operation.

It's pretty common however to want to use the same sequence repeatedly with cartesian_product. Itertools provides a second form of product for this, using a repeat argument. It would be good if we provided an equivalent too.

We'd need it to have a different name, and we'd probably want to take N as a template parameter to avoid having to allocate. So we could have something like products<3>(seq), which would be equivalent to cartestian_product(seq, seq, seq) except that we wouldn't have to store multiple references to/copies of the sequence. The size of the resulting sequence would be size(seq)^N.

Itertools also provides the permutations operation, which would also be nice to have. This yields (in lexicographical order) all the length-N permutations of the elements of the input sequence. This is a strict subset of the elements of products<N>, skipping those tuples which contain repeated elements from the underlying sequence. If S is the size of the original sequence, the size of permutations<N> is S!/(S - N)!.

Finally, Itertools has two slightly different combinations operations. The first gives a strict subset of permutations, skipping tuples in which the elements are not in "sorted" order (according to their position in the initial sequence), and has size S!/N!(S-N)!

The second, the slightly oddly named (to me) combinations.with_replacement. This does the same as combinations except that it additionally allows individual elements to be repeated -- meaning the resulting elements are a subset of product<N>, but not permutations<N>. The number of elements is (S+N-1)!/N!(S-1)!.

To make this a bit more concrete, imagine an input sequenceseq = [a, b, c, d]. Then the various operations give:

products<3>(seq): (4^3 = 64 elements)

aaa, aab, aac, aad, aba, abb, abc, abd, aca, acb, acc, acd, ada, adb, adc, add
baa, bab, bac, bad, bba, bbb, bbc, bbd, bca, bcb, bcc, bcd, bda, bdb, bdc, bdd
caa, cab, cac, cad, cba, cbb, cbc, cbd, cca, ccb, ccc, ccd, cda, cdb, cdc, cdd,
daa, dab, dac, dad, dba, dbb, dbc, dbd, dca, dcb, dcc, dcd, dda, ddb, dcd, ddd

permutations<3>(seq): (4!/(4-3!) = 24 elements)

abc, abd, acb, acd, adb, adc,
bac, bad, bca, bcd, bda, bdc,
cab, cad, cba, cbd, cda, cdb,
dab, dac, dba, dbc, dca, dcb

combinations<3>(seq): (4!/3!*(4-3)! = 24/6 = 4 elements)

abc, abd, acd, bcd

combinations_with_replacement<3>(seq): ((4+3-1)!/3!(4-1)! = 6!/3!3! = 720/36 = 20 elements)

aaa, aab, aac, aad, abb, abc, abd, acc, acd, add,
bbb, bbc, bbd, bcc, bcd, bdd,
ccc, ccd, cdd,
ddd

isaacy2012 commented 10 months ago

I'm currently working on the products feature and I have a rough implementation that produces the correct output for the example seq = [a, b, c, d]. However, the majority of the code in products.hpp is the same as that in cartesian_product.hpp.

For example, the code in for_each_while_impl is the same for both except that sizeof...(Bases) is replaced with RepeatCount.

How might be a good way to reduce code duplication here?

I was thinking of extracting common functionality to a shared base class (similar to how cartesian_product_traits_base is used for both cartesian_product and cartesian_product_with).

tcbrindle commented 10 months ago

Hi @isaacy2012,

I'm currently working on the products feature and I have a rough implementation that produces the correct output for the example seq = [a, b, c, d]. However, the majority of the code in products.hpp is the same as that in cartesian_product.hpp.

Thanks a lot for working on this.

For example, the code in for_each_while_impl is the same for both except that sizeof...(Bases) is replaced with RepeatCount.

How might be a good way to reduce code duplication here?

I was thinking of extracting common functionality to a shared base class (similar to how cartesian_product_traits_base is used for both cartesian_product and cartesian_product_with).

Yes, I think moving as much as possible up to a shared base class sounds like a good approach :)

cartesian_product_adaptor is one of the most complex adaptors we have in the library so you've chosen a tricky place to start! But if you have a working implementation already then it sounds like you're on the right track. If I can help, or if you have any questions about how Flux works then please let me know. Good luck!

Tristan

isaacy2012 commented 10 months ago

Hi Tristan,

Thanks for the feedback, I've moved the majority of the code shared between cartesian_product, cartesian_product_with, and the new cartesian_power (I feel it differentiates better with cartesian_product than simply products while keeping a standard prefix, though it does stray from the python.itertools naming – would be interested to hear your thoughts on the naming).

The code sharing between cartesian_product and cartesian_power is mostly enabled by a new get<I>() function in cartesian_traits_base that becomes std::get<I>(self.bases_) for cartesian_product and self.base_ for cartesian_power, done through CRTP. This is the main change that facilitates effectively "repeating" the sequences.

However, because cartesian_product_with is unique in that it does not expose move_at, read_at_unchecked, etc..., those functions can't be publicly visible in the cartesian_traits_base struct. I've tried two approaches:

Make them protected in cartesian_traits_base and use using statements in cartesian_product_traits_base and cartesian_power_traits_base to expose them publicly

See https://github.com/tcbrindle/flux/compare/main...isaacy2012:flux:feat/products

Use another trait base cartesian_default_read_traits_base to expose those functions by having cartesian_product_traits_base and cartesian_power_traits_base inherit from it.

See https://github.com/tcbrindle/flux/compare/main...isaacy2012:flux:feat/products-default-read-traits

I don't like 1. because of the repeated using statements, but 2. is a bit confusing because of the inheritance chain:

For cartesian_product sequence traits
---

sequence_traits<cartesian_product_adaptor<...>> : 
cartesian_product_without_traits_base<...> :
cartesian_default_read_traits_base<cartesian_product_traits_base<...>> : // inherits from the type parameter
cartesian_product_traits_base<...> :
cartesian_traits_base<...>

For cartesian_product_with sequence traits
---

cartesian_product_with_adaptor::flux_sequence_traits : 
cartesian_product_with_traits_base<...> :
cartesian_product_traits_base<...>
cartesian_traits_base<...>

For cartesian_power sequence traits
---

sequence_traits<cartesian_power_adaptor<...>> :
cartesian_power_traits_base<...> :
cartesian_default_read_traits_base<cartesian_traits_base<...>> :  // inherits from the type parameter
cartesian_traits_base<...>

What are your thoughts on those approaches, or what might be a better way of expressing the orthogonal functionality of product vs power and _with vs [without] (I imagine that power_with is a natural next step/feature).

Many thanks, Isaac

tcbrindle commented 10 months ago

Hi Tristan,

Thanks for the feedback, I've moved the majority of the code shared between cartesian_product, cartesian_product_with, and the new cartesian_power (I feel it differentiates better with cartesian_product than simply products while keeping a standard prefix, though it does stray from the python.itertools naming – would be interested to hear your thoughts on the naming).

We already have flux::product which multiplies together all the elements of a sequence, so I think having a function called products that does something completely different is probably not a great idea! :). I quite like cartesian_power as an alternative. I'll do some research into what other languages/libraries call it to see if there's a commonly used name, but we can use cartestian_power as a placeholder for now.

The code sharing between cartesian_product and cartesian_power is mostly enabled by a new get<I>() function in cartesian_traits_base that becomes std::get<I>(self.bases_) for cartesian_product and self.base_ for cartesian_power, done through CRTP. This is the main change that facilitates effectively "repeating" the sequences.

This seems like a nice way of doing it 👍

However, because cartesian_product_with is unique in that it does not expose move_at, read_at_unchecked, etc..., those functions can't be publicly visible in the cartesian_traits_base struct. I've tried two approaches:

Make them protected in cartesian_traits_base and use using statements in cartesian_product_traits_base and cartesian_power_traits_base to expose them publicly

See main...isaacy2012:flux:feat/products

Use another trait base cartesian_default_read_traits_base to expose those functions by having cartesian_product_traits_base and cartesian_power_traits_base inherit from it.

See main...isaacy2012:flux:feat/products-default-read-traits

I don't like 1. because of the repeated using statements, but 2. is a bit confusing because of the inheritance chain:
For cartesian_product sequence traits
---

sequence_traits<cartesian_product_adaptor<...>> : 
cartesian_product_without_traits_base<...> :
cartesian_default_read_traits_base<cartesian_product_traits_base<...>> : // inherits from the type parameter
cartesian_product_traits_base<...> :
cartesian_traits_base<...>
For cartesian_product_with sequence traits
---

cartesian_product_with_adaptor::flux_sequence_traits : 
cartesian_product_with_traits_base<...> :
cartesian_product_traits_base<...>
cartesian_traits_base<...>
For cartesian_power sequence traits
---

sequence_traits<cartesian_power_adaptor<...>> :
cartesian_power_traits_base<...> :
cartesian_default_read_traits_base<cartesian_traits_base<...>> :  // inherits from the type parameter
cartesian_traits_base<...>
What are your thoughts on those approaches, or what might be a better way of expressing the orthogonal functionality of product vs power and _with vs [without] (I imagine that power_with is a natural next step/feature).

(FYI, I've been intending to rename cartesian_product_with to cartesian_product_map (to better describe what it does) and it's even listed that way in the documentation, but I hadn't got round to actually changing it in the code yet!)

But yeah, I can see the difficulty: we have four adaptor classes that do fairly similar things and we'd like to make sure we reuse code sensibly between the implementations.

Of the two approaches you show, I think I'd lean towards the second one -- like you, I'm not too keen on the using directives. But I wonder if we could simplify the inheritance hierarchy a little by getting rid of the cartesian_product_traits_base and cartesian_power_traits_base classes? So we'd have:

sequence_traits<cartesian_product_adaptor<...>>
  : cartesian_read_traits<...> // "mixin" to add read_at(), read_at_unchecked() returning a tuple
     :  cartesian_traits_base<...> // provides inc(), is_last() etc etc

sequence_traits<cartesian_product_map_adaptor<...>>
    : cartesian_map_traits<...> // "mixin" to add read_at() which invokes a function
      cartesian_traits_base<...>  // provides inc(), is_last() etc

sequence_traits<cartesian_power_adaptor<...>>
   : cartesian_read_traits<...>
      : cartesian_traits_base<...>

sequence_traits<cartesian_power_map_adaptor<...>>
   : cartesian_map_traits<...>
       : cartesian_traits_base<...>

This might involve a small amount of code duplication (e.g the size() impl), but I think this is justifiable if it simplifies things?

But I have to stress that I haven't actually tried it so I might be completely wrong...!

Tristan

tcbrindle commented 10 months ago

Something else that has just occurred to me: for the "power" versions, the cursor type could be std::array<cursor_t<Base>, N> rather than tuple<cursor_t<Base>, cursor_t<Base>, ...>... I don't know if this would make things easier or more difficult though!

tcbrindle commented 10 months ago

~Hi @isaacy2012, after saying I hadn't tried it I thought I'd better have a go! This is a very rough prototype of what I had in mind: https://flux.godbolt.org/z/54Wrs3Wha~

~As you can see, there's a bit of duplication in the top-level flux_sequence_traits classes, but apart from that I think this is quite nice.~

~What do you think?~

EDIT: Never mind, I came up with a better way of doing it -- see below

tcbrindle commented 10 months ago

Hi again @isaacy2012, sorry for bombarding you with messages!

I've been thinking about this some more and I think I have a cleaner approach.

Rather than using different base classes to encode the different policies -- product/power and map/tuple -- we can instead use just a single traits class that contains everything, and uses a pair of enums to select which code paths to use. The various adaptor classes (four of them) then specify which policies they want for their sequence_traits. I've coded up a proof of concept here: https://flux.godbolt.org/z/8jexzcreb

isaacy2012 commented 10 months ago

Hi @tcbrindle

Something else that has just occurred to me: for the "power" versions, the cursor type could be std::array<cursor_t<Base>, N> rather than tuple<cursor_t<Base>, cursor_t<Base>, ...>... I don't know if this would make things easier or more difficult though!

Yep, I think that's much better than the repeated tuple, though I've kept the repeated tuple type for the value_type so that the type is the same as it is for product – should that be changed to std::array too?

Regarding the switching between product/power and map/tuple, I think I actually preferred your mixin idea rather than the if constexpr in the base trait, since it delegates specifics to the derived trait/struct rather than the base trait having an awareness of all the various code paths. Also, since product_map doesn't expose the move_at, move_at_unchecked etc. functions, would we use requires/enable_if to conditionally enable those functions based on the read_kind?

Thanks!

tcbrindle commented 10 months ago

Hi @tcbrindle

Something else that has just occurred to me: for the "power" versions, the cursor type could be std::array<cursor_t<Base>, N> rather than tuple<cursor_t<Base>, cursor_t<Base>, ...>... I don't know if this would make things easier or more difficult though!

Yep, I think that's much better than the repeated tuple, though I've kept the repeated tuple type for the value_type so that the type is the same as it is for product – should that be changed to std::array too?

No, the element type (the thing returned by read_at) is a tuple, so the value type has to be a tuple as well, because there needs to be a conversion relationship between them.

Regarding the switching between product/power and map/tuple, I think I actually preferred your mixin idea rather than the if constexpr in the base trait, since it delegates specifics to the derived trait/struct rather than the base trait having an awareness of all the various code paths.

Yeah, from a pure design perspective bundling everything into one big class that contains every code path is certainly not the recommended way! But practically speaking I think it would probably end up being easier to understand and maintain in this case than using several different base classes with CRTP. I could be wrong though :)

Also, since product_map doesn't expose the move_at, move_at_unchecked etc. functions, would we use requires/enable_if to conditionally enable those functions based on the read_kind?

Yeah, the idea was to use a requires clause, e.g.

template <typename Self>
static constexpr auto move_at(Self& self, cursor_t<Self> const& cur) -> decltype(auto)
    requires (ReadKind = read_kind::tuple)
{
    return do_read(flux::move_at, self, cur);
}

We could actually do this for all the functions -- e.g. provide two versions of first() with different CartKind constraints. I only used if constexpr in the prototype example because it's slightly less typing, and generally compiles faster than constrained overloads.

isaacy2012 commented 10 months ago

Hi @tcbrindle

Yeah, from a pure design perspective bundling everything into one big class that contains every code path is certainly not the recommended way! But practically speaking I think it would probably end up being easier to understand and maintain in this case than using several different base classes with CRTP. I could be wrong though :)

I see, thanks for the insight, yeah I think it is definitely easier to understand.

Yeah, the idea was to use a requires clause, e.g.
template <typename Self>
static constexpr auto move_at(Self& self, cursor_t<Self> const& cur) -> decltype(auto)
    requires (ReadKind = read_kind::tuple)
{
    return do_read(flux::move_at, self, cur);
}
We could actually do this for all the functions -- e.g. provide two versions of first() with different CartKind constraints. I only used if constexpr in the prototype example because it's slightly less typing, and generally compiles faster than constrained overloads.

Great, I've done this, the only snag was that neither if constexpr nor the requires worked for defining the using cursor_type = and value_type, so I've used a struct that is partially specialised on the cartesian_kind to provide those types. I've also added cartesian_power_with (I assume the renaming to map will be done separately?)

https://github.com/tcbrindle/flux/compare/main...isaacy2012:flux:feat/cartesian-product-and-with

I was also wondering for cartesian_power_with, is there a better way to define regular_invocable_repeated? I needed to define that concept to restrict the mapping function to the form fn(element_t<Seq>, /* repeated PowN times */)

I probably won't get around to implementing permutations and the others yet, should I make a pull request?

tcbrindle commented 10 months ago

Hi @tcbrindle

Yeah, from a pure design perspective bundling everything into one big class that contains every code path is certainly not the recommended way! But practically speaking I think it would probably end up being easier to understand and maintain in this case than using several different base classes with CRTP. I could be wrong though :)

I see, thanks for the insight, yeah I think it is definitely easier to understand.
Yeah, the idea was to use a requires clause, e.g.
template <typename Self>
static constexpr auto move_at(Self& self, cursor_t<Self> const& cur) -> decltype(auto)
    requires (ReadKind = read_kind::tuple)
{
    return do_read(flux::move_at, self, cur);
}
We could actually do this for all the functions -- e.g. provide two versions of first() with different CartKind constraints. I only used if constexpr in the prototype example because it's slightly less typing, and generally compiles faster than constrained overloads.
Great, I've done this, the only snag was that neither if constexpr nor the requires worked for defining the using cursor_type = and value_type, so I've used a struct that is partially specialised on the cartesian_kind to provide those types.

So I think that actually we can get away with not defining a cursor_type alias, and just use cursor_t<Self> (rather than cursor_type<Self>, sorry for my bad naming!) in the function definitions.

value_type will be a bit trickier, because we'll need to have different definitions for the tuple and map versions... or in fact we'd rather the value_type wasn't defined at all for the map versions so we fall back to the default definition.

I've also added cartesian_power_with (I assume the renaming to map will be done separately?)

I'm quite happy for you to do the rename of cartesian_product_with while we've "got the hood up"

main...isaacy2012:flux:feat/cartesian-product-and-with

I was also wondering for cartesian_power_with, is there a better way to define regular_invocable_repeated? I needed to define that concept to restrict the mapping function to the form fn(element_t<Seq>, /* repeated PowN times */)

We actually already use this concept in adjacent_map_adaptor:

https://github.com/tcbrindle/flux/blob/f31a1c8ce62fdd9134bcc71f190d252b09883b5f/include/flux/op/adjacent.hpp#L179-L190

You could rename this to repeated_invocable and re-use it in both places.

I probably won't get around to implementing permutations and the others yet, should I make a pull request?

Yeah, let's not worry about permutations or combinations yet, this is a really impressive bit of work as it is! I'm still not entirely sure about the name, but if you'd like to submit a PR then that would be great :)

Thanks!

isaacy2012 commented 10 months ago

So I think that actually we can get away with not defining a cursor_type alias, and just use cursor_t<Self> (rather than cursor_type<Self>, sorry for my bad naming!) in the function definitions.

Yep, I've got this to work :)

value_type will be a bit trickier, because we'll need to have different definitions for the tuple and map versions... or in fact we'd rather the value_type wasn't defined at all for the map versions so we fall back to the default definition.

I've kept value_type but now restricted it so it's only explicitly defined when the ReadKind is read_kind::tuple. Unfortunately, I couldn't find a good way to conditionally expose the value_type, so I've settled for this:

template <std::size_t Arity, cartesian_kind CartesianKind, read_kind ReadKind, typename... Bases>
struct cartesian_traits_base : cartesian_traits_base_impl<Arity, CartesianKind, ReadKind, Bases...> {
    using impl = cartesian_traits_base_impl<Arity, CartesianKind, ReadKind, Bases...>;
};

template <std::size_t Arity, cartesian_kind CartesianKind, typename... Bases>
struct cartesian_traits_base<Arity, CartesianKind, read_kind::tuple, Bases...> : cartesian_traits_base_impl<Arity, CartesianKind, read_kind::tuple, Bases...> {
    using impl = cartesian_traits_base_impl<Arity, CartesianKind, read_kind::tuple, Bases...>;

    using value_type = typename impl::types::value_type;
};

Which means that the friend declaration in the adaptor looks like this now:

friend flux_sequence_traits::impl;

It's a small change but I don't like having the impl leak out into its use in the adaptor. Can you think of a better way to conditionally expose the value_type?

isaacy2012 commented 10 months ago

You could rename this to repeated_invocable and re-use it in both places.

I moved it to flux/core/concepts.hpp, which I've just realised is probably the wrong place, but I'm not quite sure where to put it since it's now used by both adjacent and cartesian_power_with. Possibly in a new file flux/op/detail/concepts.hpp?

tcbrindle commented 10 months ago

You could rename this to repeated_invocable and re-use it in both places.

I moved it to flux/core/concepts.hpp, which I've just realised is probably the wrong place, but I'm not quite sure where to put it since it's now used by both adjacent and cartesian_power_with. Possibly in a new file flux/op/detail/concepts.hpp?

There's flux/op/requirements.hpp for concepts that are useful to re-use between adaptors but aren't "core", if you see what I mean

nwad123 commented 1 week ago

I'm interested in taking a stab at implementing a permutation operation, but before I seriously get working on it I did a bit of research and realized there are a few different ways of approaching the problem. Pythons itertools.permutations operation seems to do the permutation in the same order every time regardless of whether or not the input range is sorted. C++s std::next_permutation always does the next lexicographical permutation of an input range, which means the sequence of permutations produced by std::next_permutation is dependent on the ordering of the input range.

C++ example on compiler explorer

Python example on compiler explorer

+------------+------------+
|      Perm[1,2,3]:       |
+------------+------------+
|    C++     |   Python   |
+------------+------------+
|  (1,2,3)   |  (1,2,3)   |
|  (1,3,2)   |  (1,3,2)   |
|  (2,1,3)   |  (2,1,3)   |
|  (2,3,1)   |  (2,3,1)   |
|  (3,1,2)   |  (3,1,2)   |
|  (3,2,1)   |  (3,2,1)   |
+------------+------------+

+------------+------------+
|      Perm[3,2,1]:       |
+------------+------------+
|    C++     |   Python   |
+------------+------------+
|  (3,2,1)   |  (3,2,1)   |
|  (1,2,3)   |  (3,1,2)   |
|  (1,3,2)   |  (2,3,1)   |
|  (2,1,3)   |  (2,1,3)   |
|  (2,3,1)   |  (1,3,2)   |
|  (3,1,2)   |  (1,2,3)   |
+------------+------------+

The links above have a simple example of producing all the permutations of two lists, [1, 2, 3], and [3, 2, 1] and the table is shown in the code block. The C++ and Python solutions output the same order of permutations for [1, 2, 3], but they output a different ordering of permutations for [3, 2, 1]. Both options are valid, but to me the C++ method makes more sense. I would imagine that in flux if I use the permutation operation I would get the same output that calling std::next_permutation S! / (S - N)! times would produce. What do other people think would be best?

nwad123 commented 1 week ago

The Rust itertools.permutations() seems to produce the same output as Pythons itertools.permutation, so maybe the Python/Rust method is actually a better way to go as opposed to the C++ method.

Rust example on compiler explorer

tcbrindle commented 1 week ago

Hi @nwad123, great to hear you're interested in taking this on!

Regarding the two difference approaches, the STL's next_permutation requires that the input sequence is bidirectional, and that its elements are comparable. The Python version (IIUC) doesn't require either of these constraints, so I think the Python approach is the better way to go for Flux.

We could in principle add the STL ordering as something like ordered_permutations later if it turns out to be desirable.