timotheecour / Nim

Nim is a compiled, garbage-collected systems programming language with a design that focuses on efficiency, expressiveness, and elegance (in that order of priority).
http://nim-lang.org/
Other
2 stars 0 forks source link

add inplace `strutils.strip` #422

Closed timotheecour closed 3 years ago

timotheecour commented 3 years ago

refs https://github.com/nim-lang/Nim/pull/16207#discussion_r536478385

more efficient than strip(leading = false) or even strip

Note that inplace also makes sense without leading = false: it amounts to a memmv in the case where there's a prefix to strip.

ringabout commented 3 years ago

strip inplace

ref https://github.com/nim-lang/Nim/pull/13610

Related: https://techoverflow.net/2017/01/24/in-place-trimmingstripping-in-c/ https://techoverflow.net/2017/01/23/zero-copy-in-place-string-splitting-in-c/

import strutils

proc strip*(s: var string, leading = true, trailing = true,
            chars: set[char] = Whitespace) =

  var
    first = 0
    last = len(s)-1
  if leading:
    while first <= last and s[first] in chars: inc(first)
  if trailing:
    while last >= 0 and s[last] in chars: dec(last)

  if first > 0:
    for index in first .. last:
      s[index - first] = s[index]

  s.setLen(last - first + 1)

var a = "  vhellov   "
strip(a)
doAssert a == "vhellov"

a = "  vhellov   "
a.strip(leading = false)
doAssert a == "  vhellov"

a = "  vhellov   "
a.strip(trailing = false)
doAssert a == "vhellov   "

a.strip()
a.strip(chars = {'v'})
doAssert a == "hello"

a = "  vhellov   "
a.strip()
a.strip(leading = false, chars = {'v'})
doAssert a == "vhello"

var c = "blaXbla"
c.strip(chars = {'b', 'a'})
doAssert c == "laXbl"
c = "blaXbla"
c.strip(chars = {'b', 'a', 'l'})
doAssert c == "X"

I doubt moveMem works in VM or js (need to try) but if not, we should introduce an API that can be used in all backends and uses moveMem for c backend; maybe something like this:

or

 when defined(js) or ...:
    for index in first .. last:
      s[index - first] = s[index]
 else:
    moveMem(addr result[0], addr result[first], last - first + 1)
timotheecour commented 3 years ago

great!

 if first > 0:
    for index in first .. last:
      s[index - first] = s[index]

=> should use system.moveMem

more generally, a lot of procs should become inline, the question is whether it'd be in some new strutils2 (https://github.com/timotheecour/Nim/issues/300) or in existing strutils see also:

ringabout commented 3 years ago

But moveMem is unsafe? Is it worth introducing unsafe blocks?

proc substr*(s: string, first, last: int): string =
  let first = max(first, 0)
  let L = max(min(last, high(s)) - first + 1, 0)
  result = newString(L)
  for i in 0 .. L-1:
    result[i] = s[i+first]
timotheecour commented 3 years ago

But moveMem is unsafe?

that's only a problem if misused, the semantics are clear.

Is it worth introducing unsafe blocks?

for performance, yes. stdlib should do the dirty work for performance hidden behind a safe API (which strip would be) so users don't ever have to re-implement those

note

I doubt moveMem works in VM or js (need to try) but if not, we should introduce an API that can be used in all backends and uses moveMem for c backend; maybe something like this:

proc moveMem(dest, source: pointer; size: Natural) = ...

# new:
proc moveMem[T](dest: var openArray[T], source: openArray[T]) = ...
  ## dest, source are allowed to overlap
proc copyMem[T](dest: var openArray[T], source: openArray[T]) = ...
  ## dest, source are not allowed to overlap
ringabout commented 3 years ago

strutils2.nim

I will change the implementaion of this to get better performance

proc toLowerAscii*(a: var string) =
  for c in mitems(a):
    c = chr(c.ord + (if c.isUpperAscii: (ord('a') - ord('A')) else: 0))

to

proc toLowerAscii*(a: var string) =
  for c in mitems(a):
    c = char(if c.isUpperAscii: uint8(c) xor 0b0010_0000'u8 else: uint8(c))

The trick is really confusing, the two codes get the similar result.

timotheecour commented 3 years ago

nice!

if c in {'A'..'Z'}:

=> c.isUpperAscii:

should be same right?

ringabout commented 3 years ago

Yeah, the same except that I inline the operation. Edited: I noticed isUpperAscii is inlined at strutils2.nim(not in strutils.nim) just now ...

timotheecour commented 3 years ago

char(uint8(c) xor 0b0010_0000'u8)

timotheecour commented 3 years ago

indeed...

best is to reuse code instead of copy pasting implementation, for those places:

(+ elsewhere maybe)

=> these can reuse toUpperAscii / toLowerAscii (maybe after making it {.inline.})

juancarlospaco commented 3 years ago

Hello, I didnt know this existed seen linked on a PR, move/delete if needed. Heres some ideas:

import bitops
doAssert char(bitor(ord('A'), 32)) == 'a'
doAssert char(bitand(ord('a'), 95)) == 'A'

1.25 ~ 2+ X speed VS stdlib way, rare worse case is same speed, I guess because shift with intrinsics of bitop. Try ord :arrow_right: uint8 too.

ringabout commented 3 years ago

PR is welcome, please provide a better benchmark program.

juancarlospaco commented 3 years ago

Theres a lot of hardcoded "random guess" values for newStringOfCap, almost always wrong. This is usually kinda right:

template wordsToCap*(wordCount: static[Positive]): static[int] =
  ## Average length of words, World `9`, English `6` (Wikipedia, Wolfram, Oxford, etc).
  ## Typing race competences use lower average length of words of `6` and higher.
  ## Yes theres shorter and longer words but this is the most frequently used words.
  ## http://ravi.io/language-word-lengths
  ## (9 * wordCount) + (len(whitespace or newline) * wordCount)
  const averageWordLength {.strdefine.}: Positive = 9
  # max((averageWordLength * wordCount) + wordCount , 1_000) # Alternative with "threshold".
  (averageWordLength * wordCount) + wordCount # Still better than a random guess

https://github.com/nim-lang/Nim/blob/662c5080755eb5a42df2c3acb84c044876571a46/compiler/filters.nim#L55 https://github.com/nim-lang/Nim/blob/662c5080755eb5a42df2c3acb84c044876571a46/compiler/filters.nim#L69 https://github.com/nim-lang/Nim/blob/226595515c25785eaf078834dfb6f0ac337a5278/compiler/ccgmerge.nim#L153 https://github.com/nim-lang/Nim/blob/662c5080755eb5a42df2c3acb84c044876571a46/compiler/ndi.nim#L39 https://github.com/nim-lang/Nim/blob/662c5080755eb5a42df2c3acb84c044876571a46/lib/system/fatal.nim#L34 https://github.com/nim-lang/Nim/blob/662c5080755eb5a42df2c3acb84c044876571a46/compiler/syntaxes.nim#L42 https://github.com/nim-lang/Nim/blob/555cfd1d59e773e92a7158b7c3b5908017f1ddac/testament/testament.nim#L124 https://github.com/nim-lang/Nim/blob/bb1c962286922487c5243ca7304fd95fdd27ea53/lib/system/excpt.nim#L258 https://github.com/nim-lang/Nim/blob/bb1c962286922487c5243ca7304fd95fdd27ea53/lib/system/excpt.nim#L359 https://github.com/nim-lang/Nim/blob/662c5080755eb5a42df2c3acb84c044876571a46/compiler/llstream.nim#L111 https://github.com/nim-lang/Nim/blob/226595515c25785eaf078834dfb6f0ac337a5278/compiler/rodimpl.nim#L362 https://github.com/nim-lang/Nim/blob/226595515c25785eaf078834dfb6f0ac337a5278/compiler/rodimpl.nim#L25 https://github.com/nim-lang/Nim/blob/23d23ecb081be6702d74024be8f96d92d9f88a59/lib/system/io.nim#L486 https://github.com/nim-lang/Nim/blob/122f22d1632255cf7d0539c620a30155dc84fd69/lib/pure/asynchttpserver.nim#L299 ... theres tons of it, just search newStringOfCap(

Hardcoding 80 is kinda frequent, but totally random and made up, and in my experience in practice always wrong by far.

It is compile time and simple logic, so mostly bug free. Works for newString, newSeqOfCap, etc.

Maybe a better algo is found on the future, and the proc can be improved without breaking api.

Maybe you can add extra "padding" if theres lots of RAM installed ?.

Worse case scenario is always better than a hardcoded arbitrary number, like a cache you hit the happy path frequently.

ringabout commented 3 years ago

Looks good to me. Please convince Araq given a great and comprehensive benchmark or other supporting materials :)