nim-lang / RFCs

A repository for your Nim proposals.
136 stars 23 forks source link

Generic `count` for stdlib use outside `len` #217

Closed metagn closed 4 years ago

metagn commented 4 years ago

Originally discussed in https://github.com/nim-lang/Nim/pull/14162. Writing so many RFCs is taxing but this needs its own place to be discussed.

Problem: cstring is a victim of improper use by templates/generic procs that check for a len overload on a type. cstrings len is O(n) with n pointer dereferences which is needlessly complex for anything named len. The length of some data structures (lists, cstrings) are non-constant. This can cause some algorithms that use the length but don't need it, like generating newSeq(x.len) for efficiency instead of an empty seq and adding to it, to be needlessly slow.

Proposal: len should be conventionally defined for procs that have constant/fast access to their length, and there should be a generic unary count (maybe go in sequtils? can debate name) that iterates over the type. len for cstring should be deprecated outside of JS.

# template so as to work for iterators
template count(x: untyped): int =
  var res = 0
  for _ in x:
    inc res
  res

This works on top of sequtils.count since that one takes 2 arguments, same with countIt. It's also in line with how other languages implement and apply meaning to count. An alternative implementation could be:

from sequtils import countIt
template count(x: untyped): int = countIt(x, true)

After defining this original count, we can extend it to better support types that have len defined, like so:

type Lengthable = concept x
  x.len is int

template count(x: Lengthable) = x.len
template count(x: typed) = # ...

This implementation would break the use of count for iterators, but the point of count is to write better algorithms for data structures, you can just use toSeq at that point. countIt works otherwise.

Flaws & backwards compatibility: The name count should be backwards compatible, other options include getLen, countLen. Deprecating cstring.len is the crux of this issue.

I have an idea as to how you can deprecate it for JS too. JS cstring actually counts unicode characters as 1 character (https://github.com/nim-lang/Nim/issues/10911), so we might use a new overload instead of len here, but I don't know what the name would be.

If not, and JS cstring len should be kept, then that's also doable, but it would be hard to document. Declaration branching between different backends is never fun and when it's for a deprecated symbol that's even worse JS cstring is a topic for another day, not deprecating len for cstrings is fine, c_strlen should stay in system anyway

The main backwards compatibility issue is porting standard library procs to count, this would only happen when we start using concepts though and that's going to be backwards incompatible in of itself

disruptek commented 4 years ago

I like sequtils.count. Another example is linkedlists, but I would rather that collection simply offered a type that hid all the machinery and stored the size of the list. Still, count should be a thing.

bung87 commented 4 years ago

for js unicode there's https://nim-lang.org/docs/unicode.html#runeLen%2Cstring

Araq commented 4 years ago

Why do we need count? If it's allowed to be expensive in O(N) terms generic code should avoid it anyway.

metagn commented 4 years ago

The point here is to make sure generic code doesn't use it. If we have a concept like Indexable that uses [] and len then it should warn or error when using it for a type like cstring. Since we can't just delete len for cstring (also we need it for JS), we have to deprecate it on C backend, so a proc that uses Indexable will create a warning for now. count is merely a last resort in case you really need the length of something, you should be inclined to do let L = x.count first.

Also, this doesn't close https://github.com/nim-lang/Nim/pull/14162, I miscommunicated, count is separate from isEmpty. @disruptek you can reopen if you want with Araq's suggestion

Araq commented 4 years ago

Well so len for cstring needs a deprecation period and instead count but why do we need sequtils.count for everything that defines items?

More importantly, the premise is wrong: In practice length information in a generic context can be valuable and speed up computations even if len is O(N):


        var result = newSeq[type(iter)](iter2.len) # ok, need to iterate over the C string once
        var i = 0
        for x in iter2: # ok, need to iterate over it once again, but now it's in the cache
          result[i] = x
          inc i
        result
metagn commented 4 years ago

why do we need sequtils.count for everything that defines items

The point here is we separate everything that has len vs everything that can have len calculated. We could do:

template count(x: Indexable): int = x.len

when Indexable is implemented, or for now just compiles(x.len). If a proc, say, in strutils, needs the length of a parameter (rfind), they would use count, but everything else should check for len for the most efficiency (if you believe cstring is efficient with len then we can simply not deprecate it) then fall back to a generic solution that uses a cap or something.

I'll add a when compiles(x.len) branch in my PR if you want but it's going to be messy like toSeq is since it has to account for iterators.

Araq commented 4 years ago

As I said, the issue is unconvincing for today's computers. Preallocated sizes are often more important to have than avoiding O(N) traversals and if findIt is bad for type T even though T has a len, findIt could be specialized for T. Currently with our untyped design that's unfeasible to do, so maybe we should put our focus on this "untyped it templates" issue instead.

disruptek commented 4 years ago

The issue has less to do with the hardware and more to do with the programmers being able to communicate with one another as to the nature of the operation. You admit that O(n) won't always be avoided, so you can't also argue that count shouldn't be exposed to users that need it. With count, they know what they are getting when they need to ask for it.

mratsim commented 4 years ago

For ensembles it can be card (for cardinality) otherwise a slightly longer countItems might convey more the O(n) nature?

In any case, trees, graphs and other linked data-structure might not store counts of a subgraph/subtree and a count convention might be useful there.

metagn commented 4 years ago

Weird RFC, closing since there's no real point here