add inplace `strutils.strip`

timotheecour commented 3 years ago

refs https://github.com/nim-lang/Nim/pull/16207#discussion_r536478385

more efficient than strip(leading = false) or even strip

Note that inplace also makes sense without leading = false: it amounts to a memmv in the case where there's a prefix to strip.

ringabout commented 3 years ago

strip inplace

ref https://github.com/nim-lang/Nim/pull/13610

import strutils

proc strip*(s: var string, leading = true, trailing = true,
            chars: set[char] = Whitespace) =

  var
    first = 0
    last = len(s)-1
  if leading:
    while first <= last and s[first] in chars: inc(first)
  if trailing:
    while last >= 0 and s[last] in chars: dec(last)

  if first > 0:
    for index in first .. last:
      s[index - first] = s[index]

  s.setLen(last - first + 1)

var a = "  vhellov   "
strip(a)
doAssert a == "vhellov"

a = "  vhellov   "
a.strip(leading = false)
doAssert a == "  vhellov"

a = "  vhellov   "
a.strip(trailing = false)
doAssert a == "vhellov   "

a.strip()
a.strip(chars = {'v'})
doAssert a == "hello"

a = "  vhellov   "
a.strip()
a.strip(leading = false, chars = {'v'})
doAssert a == "vhello"

var c = "blaXbla"
c.strip(chars = {'b', 'a'})
doAssert c == "laXbl"
c = "blaXbla"
c.strip(chars = {'b', 'a', 'l'})
doAssert c == "X"

I doubt moveMem works in VM or js (need to try) but if not, we should introduce an API that can be used in all backends and uses moveMem for c backend; maybe something like this:

[x] add moveMem in JS backend and VM and change index to moveMem or use when defined

or

 when defined(js) or ...:
    for index in first .. last:
      s[index - first] = s[index]
 else:
    moveMem(addr result[0], addr result[first], last - first + 1)

timotheecour commented 3 years ago

great!

 if first > 0:
    for index in first .. last:
      s[index - first] = s[index]

=> should use system.moveMem

more generally, a lot of procs should become inline, the question is whether it'd be in some new strutils2 (https://github.com/timotheecour/Nim/issues/300) or in existing strutils see also:

https://github.com/timotheecour/Nim/pull/54
compiler/strutils2

ringabout commented 3 years ago

But moveMem is unsafe? Is it worth introducing unsafe blocks?

proc substr*(s: string, first, last: int): string =
  let first = max(first, 0)
  let L = max(min(last, high(s)) - first + 1, 0)
  result = newString(L)
  for i in 0 .. L-1:
    result[i] = s[i+first]

timotheecour commented 3 years ago

But moveMem is unsafe?

that's only a problem if misused, the semantics are clear.

Is it worth introducing unsafe blocks?

for performance, yes. stdlib should do the dirty work for performance hidden behind a safe API (which strip would be) so users don't ever have to re-implement those

note

I doubt moveMem works in VM or js (need to try) but if not, we should introduce an API that can be used in all backends and uses moveMem for c backend; maybe something like this:

proc moveMem(dest, source: pointer; size: Natural) = ...

# new:
proc moveMem[T](dest: var openArray[T], source: openArray[T]) = ...
  ## dest, source are allowed to overlap
proc copyMem[T](dest: var openArray[T], source: openArray[T]) = ...
  ## dest, source are not allowed to overlap

ringabout commented 3 years ago

strutils2.nim

I will change the implementaion of this to get better performance

proc toLowerAscii*(a: var string) =
  for c in mitems(a):
    c = chr(c.ord + (if c.isUpperAscii: (ord('a') - ord('A')) else: 0))

to

proc toLowerAscii*(a: var string) =
  for c in mitems(a):
    c = char(if c.isUpperAscii: uint8(c) xor 0b0010_0000'u8 else: uint8(c))

The trick is really confusing, the two codes get the similar result.

timotheecour commented 3 years ago

nice!

if c in {'A'..'Z'}:

=> c.isUpperAscii:

should be same right?

ringabout commented 3 years ago

Yeah, the same except that I inline the operation. Edited: I noticed isUpperAscii is inlined at strutils2.nim(not in strutils.nim) just now ...

timotheecour commented 3 years ago

I know, i meant performance wise should be same, in which case c.isUpperAscii is preferable (better to reuse code); (ok it's a very minor point...)

char(uint8(c) xor 0b0010_0000'u8)

I wonder if similar tricks could be used elsewhere; refs https://github.com/timotheecour/Nim/issues/414

timotheecour commented 3 years ago

indeed...

best is to reuse code instead of copy pasting implementation, for those places:

[ ] compiler/strutils2.toLowerAscii

[ ] strutils.normalize (uses:

if s[i] in {'A'..'Z'}:
  result[j] = chr(ord(s[i]) + (ord('a') - ord('A')))

[ ] nimIdentNormalize

if s[i] in {'A'..'Z'}:
  result[j] = chr(ord(s[i]) + (ord('a') - ord('A')))
  inc j

(+ elsewhere maybe)

=> these can reuse toUpperAscii / toLowerAscii (maybe after making it {.inline.})

juancarlospaco commented 3 years ago

Hello, I didnt know this existed seen linked on a PR, move/delete if needed. Heres some ideas:

import bitops
doAssert char(bitor(ord('A'), 32)) == 'a'
doAssert char(bitand(ord('a'), 95)) == 'A'

1.25 ~ 2+ X speed VS stdlib way, rare worse case is same speed, I guess because shift with intrinsics of bitop. Try ord :arrow_right: uint8 too.

ringabout commented 3 years ago

PR is welcome, please provide a better benchmark program.

juancarlospaco commented 3 years ago

Theres a lot of hardcoded "random guess" values for newStringOfCap, almost always wrong. This is usually kinda right:

template wordsToCap*(wordCount: static[Positive]): static[int] =
  ## Average length of words, World `9`, English `6` (Wikipedia, Wolfram, Oxford, etc).
  ## Typing race competences use lower average length of words of `6` and higher.
  ## Yes theres shorter and longer words but this is the most frequently used words.
  ## http://ravi.io/language-word-lengths
  ## (9 * wordCount) + (len(whitespace or newline) * wordCount)
  const averageWordLength {.strdefine.}: Positive = 9
  # max((averageWordLength * wordCount) + wordCount , 1_000) # Alternative with "threshold".
  (averageWordLength * wordCount) + wordCount # Still better than a random guess

Usage https://github.com/juancarlospaco/nim-bytes2human/blob/master/src/bytes2human.nim#L63

Hardcoding 80 is kinda frequent, but totally random and made up, and in my experience in practice always wrong by far.

It is compile time and simple logic, so mostly bug free. Works for newString, newSeqOfCap, etc.

Maybe a better algo is found on the future, and the proc can be improved without breaking api.

Maybe you can add extra "padding" if theres lots of RAM installed ?.

Worse case scenario is always better than a hardcoded arbitrary number, like a cache you hit the happy path frequently.

ringabout commented 3 years ago

Looks good to me. Please convince Araq given a great and comprehensive benchmark or other supporting materials :)

timotheecour / Nim

add inplace `strutils.strip` #422

note