tcbrindle / flux

A C++20 library for sequence-orientated programming
https://tristanbrindle.com/flux/
Boost Software License 1.0
472 stars 29 forks source link

Reference stability requirements? #101

Open brevzin opened 1 year ago

brevzin commented 1 year ago

If I have a sequence whose element type is T&, are there any rules around what the lifetime of that reference needs to be? Specifically, is this necessary valid:

auto cursor = flux::first(seq);
T& r = flux::read_at(seq, cursor);
flux::inc(seq, cursor);
use(r); // cursor has advanced, can I still use r?

One motivation for this question is how to do algorithms. Like min is currently https://github.com/tcbrindle/flux/blob/87b50c52b74123442e981ff1394119af9db7cf84/include/flux/op/minmax.hpp#L23-L37

where fold_first is https://github.com/tcbrindle/flux/blob/87b50c52b74123442e981ff1394119af9db7cf84/include/flux/op/fold.hpp#L36-L59

What this means is that if I have a sequence where value=string and element=string const&, and I'm finding the minimum string, the worst-case scenario (the sequence is perfectly sorted, backwards) is that I'm doing N string copies. But in this case, if references were guaranteed/required stable, you could do a much more efficient approach via this:

template <sequence Seq, strict_weak_order_for<Seq> Cmp = std::ranges::less> 
[[nodiscard]] 
constexpr auto min_ref(Seq&& seq, Cmp cmp = Cmp{}) const 
    -> flux::optional<element_t<Seq>> // <== NB
{
    auto cursor = flux::first(seq);
    if (flux::is_last(seq, cursor)) {
        return nullopt;
    }

    value_t<Seq> const* best = &flux::read_at(seq, cursor);
    flux::inc(seq, cursor);
    while (not flux::is_last(seq, cursor)) {
        value_t<Seq> const* elem = &flux::read_at(seq, cursor);
        if (std::invoke(cmp, *elem, *best)) {
            best = elem;
        }
        flux::inc(seq, cursor);
    }
    return *best;
}

This does zero string copies. Basically, this is min_element instead of min. This version returns an optional reference, but an equivalent that returns an optional value would still be better since it does exactly 1 string copy.

So I guess the question is:

If a sequence's element type is an lvalue reference, does advancing the cursor invalidate that reference?

If NO, then we probably want a version of fold_first that returns optional<element> instead of optional<value>. Possibly up to even having fold_first itself just only ever return optional<element> (which is what it does in Rust [^1]) or possibly have the callable required to either return value or element and return accordingly.

If YES, then... I guess no further work necessary, but probably should be noted somewhere regardless.

[^1]: Rust originally called it fold_first, which I thought was a good name, so that's what I named our version of the algorithm for C++23, then they rename it to reduce. Jerks.

tcbrindle commented 1 year ago

I haven't documented this anywhere, but I definitely should, so here we go...

For single-pass sequences, there is no reference stability: that is, if you're holding on to a reference to an element returned by read_at(), it may be invalidated by the next call to inc() (via any cursor, but of course you should only have one when dealing with single-pass sequences).

For multipass_sequences, element references should not be invalidated, and your min_ref above should be valid.

I've previously thought there should be alternative versions of min/max returning optional<element_t> as you suggest: I don't know whether it would be better to just make these overloads with the same names, or call them something different as with ranges::min and ranges::min_element.

A related question that I haven't quite answered yet is about cursor invalidation, specifically, if I have a valid cursor cur for a copyable multipass sequence seq, then does cur need to be valid for a copy of seq? This seems like a desirable property and is true for integer indices, but I'd worry that it might limit implementations?

(I think we do need to require that given auto seq2 = std::move(seq), then cur is valid for seq2, but I think that's an easier thing to make work.)

brevzin commented 1 year ago

(I think we do need to require that given auto seq2 = std::move(seq), then cur is valid for seq2, but I think that's an easier thing to make work.)

I don't think you can do that in general? Especially if you're providing an adapter from a C++ range to a sequence, since there's definitely no guarantee that an iterator is still valid through a move.

tcbrindle commented 1 year ago

The particular example I had in mind was a rotate adaptor:

template <multipass_sequence Seq>
auto rotate(Seq seq, cursor_t<Seq> mid) -> multipass_sequence auto;

This is a relatively easy adaptor to write: we iterate from mid until the end of seq, and then from the start of seq up to mid. But since we need to store (a move-initialised version of) mid in the adaptor object, we have to make sure that it's still valid for the move-initialised version of seq that the adaptor also stores.

tcbrindle commented 1 year ago

Having pondered it some more, I think that having fold_first() (and min() and friends) returning optional<element_t> is too close to the kind of unsafe interface I'd rather avoid -- it would make it too easy to pass an rvalue sequence in and get a dangling optional<T&> out:

auto elem = flux::min(get_vector()); // assume this returns optional<element_t>

if (elem.has_value()) { // engaged if source vec was non-empty
    do_stuff(*elem); // uh-oh
}

Instead I think it would be better to have a more direct equivalent of std::min_element, that is, an algorithm which returns a cursor to the minimum element rather than the element itself. This is not only safer, but also potentially more useful -- I can definitely imagine situations where I'd want to know the position of the minimum/maximum element. The implementation could be something like:

template <multipass_sequence Seq, typename Cmp = std::ranges::less>
auto find_min(Seq&& seq, Cmp cmp = {}) -> cursor_t<Seq>
{
    auto min = first(seq);
    if (is_last(seq, min)) { 
        return min;
    }

    for (auto cur = next(seq, min); !is_last(seq, cur); inc(seq, cur)) {
        if (invoke(cmp, read_at(seq, cur), read_at(seq, min)) {
           min = cur;
        } 
    }

    return min;
}

We could also potentially optimise min() (and maybe fold_first()?) to avoid copies in the case where we're dealing with a multipass sequence whose element type is an lvalue reference, but I think the external interface (returning optional<value_t>) is probably best left as-is.

brevzin commented 1 year ago

Downside of min_element, outside of just rarely being the interface you want, is that you also lose efficiency in the case where read_at actually has to do work (e.g. you're looking up the minimum of a .map(f), you're going to have to call f lots of times, and then once more to read out the value).

Certainly true that returning a cursor is safer.