Closed timotheecour closed 3 years ago
strip inplace
ref https://github.com/nim-lang/Nim/pull/13610
Related: https://techoverflow.net/2017/01/24/in-place-trimmingstripping-in-c/ https://techoverflow.net/2017/01/23/zero-copy-in-place-string-splitting-in-c/
import strutils
proc strip*(s: var string, leading = true, trailing = true,
chars: set[char] = Whitespace) =
var
first = 0
last = len(s)-1
if leading:
while first <= last and s[first] in chars: inc(first)
if trailing:
while last >= 0 and s[last] in chars: dec(last)
if first > 0:
for index in first .. last:
s[index - first] = s[index]
s.setLen(last - first + 1)
var a = " vhellov "
strip(a)
doAssert a == "vhellov"
a = " vhellov "
a.strip(leading = false)
doAssert a == " vhellov"
a = " vhellov "
a.strip(trailing = false)
doAssert a == "vhellov "
a.strip()
a.strip(chars = {'v'})
doAssert a == "hello"
a = " vhellov "
a.strip()
a.strip(leading = false, chars = {'v'})
doAssert a == "vhello"
var c = "blaXbla"
c.strip(chars = {'b', 'a'})
doAssert c == "laXbl"
c = "blaXbla"
c.strip(chars = {'b', 'a', 'l'})
doAssert c == "X"
I doubt moveMem works in VM or js (need to try) but if not, we should introduce an API that can be used in all backends and uses moveMem for c backend; maybe something like this:
or
when defined(js) or ...:
for index in first .. last:
s[index - first] = s[index]
else:
moveMem(addr result[0], addr result[first], last - first + 1)
great!
if first > 0:
for index in first .. last:
s[index - first] = s[index]
=> should use system.moveMem
more generally, a lot of procs should become inline, the question is whether it'd be in some new strutils2 (https://github.com/timotheecour/Nim/issues/300) or in existing strutils see also:
But moveMem
is unsafe? Is it worth introducing unsafe blocks?
proc substr*(s: string, first, last: int): string =
let first = max(first, 0)
let L = max(min(last, high(s)) - first + 1, 0)
result = newString(L)
for i in 0 .. L-1:
result[i] = s[i+first]
But moveMem is unsafe?
that's only a problem if misused, the semantics are clear.
Is it worth introducing unsafe blocks?
for performance, yes. stdlib should do the dirty work for performance hidden behind a safe API (which strip would be) so users don't ever have to re-implement those
I doubt moveMem
works in VM or js (need to try) but if not, we should introduce an API that can be used in all backends and uses moveMem
for c backend; maybe something like this:
proc moveMem(dest, source: pointer; size: Natural) = ...
# new:
proc moveMem[T](dest: var openArray[T], source: openArray[T]) = ...
## dest, source are allowed to overlap
proc copyMem[T](dest: var openArray[T], source: openArray[T]) = ...
## dest, source are not allowed to overlap
strutils2.nim
I will change the implementaion of this to get better performance
proc toLowerAscii*(a: var string) =
for c in mitems(a):
c = chr(c.ord + (if c.isUpperAscii: (ord('a') - ord('A')) else: 0))
to
proc toLowerAscii*(a: var string) =
for c in mitems(a):
c = char(if c.isUpperAscii: uint8(c) xor 0b0010_0000'u8 else: uint8(c))
The trick is really confusing, the two codes get the similar result.
nice!
if c in {'A'..'Z'}:
=> c.isUpperAscii:
should be same right?
Yeah, the same except that I inline the operation.
Edited:
I noticed isUpperAscii
is inlined at strutils2.nim
(not in strutils.nim
) just now ...
c.isUpperAscii
is preferable (better to reuse code); (ok it's a very minor point...)char(uint8(c) xor 0b0010_0000'u8)
indeed...
best is to reuse code instead of copy pasting implementation, for those places:
if s[i] in {'A'..'Z'}:
result[j] = chr(ord(s[i]) + (ord('a') - ord('A')))
nimIdentNormalize
if s[i] in {'A'..'Z'}:
result[j] = chr(ord(s[i]) + (ord('a') - ord('A')))
inc j
(+ elsewhere maybe)
=> these can reuse toUpperAscii / toLowerAscii
(maybe after making it {.inline.}
)
Hello, I didnt know this existed seen linked on a PR, move/delete if needed. Heres some ideas:
import bitops
doAssert char(bitor(ord('A'), 32)) == 'a'
doAssert char(bitand(ord('a'), 95)) == 'A'
1.25 ~ 2+ X speed VS stdlib way, rare worse case is same speed, I guess because shift with intrinsics of bitop.
Try ord
:arrow_right: uint8
too.
PR is welcome, please provide a better benchmark program.
Theres a lot of hardcoded "random guess" values for newStringOfCap
, almost always wrong. This is usually kinda right:
template wordsToCap*(wordCount: static[Positive]): static[int] =
## Average length of words, World `9`, English `6` (Wikipedia, Wolfram, Oxford, etc).
## Typing race competences use lower average length of words of `6` and higher.
## Yes theres shorter and longer words but this is the most frequently used words.
## http://ravi.io/language-word-lengths
## (9 * wordCount) + (len(whitespace or newline) * wordCount)
const averageWordLength {.strdefine.}: Positive = 9
# max((averageWordLength * wordCount) + wordCount , 1_000) # Alternative with "threshold".
(averageWordLength * wordCount) + wordCount # Still better than a random guess
https://github.com/nim-lang/Nim/blob/662c5080755eb5a42df2c3acb84c044876571a46/compiler/filters.nim#L55
https://github.com/nim-lang/Nim/blob/662c5080755eb5a42df2c3acb84c044876571a46/compiler/filters.nim#L69
https://github.com/nim-lang/Nim/blob/226595515c25785eaf078834dfb6f0ac337a5278/compiler/ccgmerge.nim#L153
https://github.com/nim-lang/Nim/blob/662c5080755eb5a42df2c3acb84c044876571a46/compiler/ndi.nim#L39
https://github.com/nim-lang/Nim/blob/662c5080755eb5a42df2c3acb84c044876571a46/lib/system/fatal.nim#L34
https://github.com/nim-lang/Nim/blob/662c5080755eb5a42df2c3acb84c044876571a46/compiler/syntaxes.nim#L42
https://github.com/nim-lang/Nim/blob/555cfd1d59e773e92a7158b7c3b5908017f1ddac/testament/testament.nim#L124
https://github.com/nim-lang/Nim/blob/bb1c962286922487c5243ca7304fd95fdd27ea53/lib/system/excpt.nim#L258
https://github.com/nim-lang/Nim/blob/bb1c962286922487c5243ca7304fd95fdd27ea53/lib/system/excpt.nim#L359
https://github.com/nim-lang/Nim/blob/662c5080755eb5a42df2c3acb84c044876571a46/compiler/llstream.nim#L111
https://github.com/nim-lang/Nim/blob/226595515c25785eaf078834dfb6f0ac337a5278/compiler/rodimpl.nim#L362
https://github.com/nim-lang/Nim/blob/226595515c25785eaf078834dfb6f0ac337a5278/compiler/rodimpl.nim#L25
https://github.com/nim-lang/Nim/blob/23d23ecb081be6702d74024be8f96d92d9f88a59/lib/system/io.nim#L486
https://github.com/nim-lang/Nim/blob/122f22d1632255cf7d0539c620a30155dc84fd69/lib/pure/asynchttpserver.nim#L299
... theres tons of it, just search newStringOfCap(
Hardcoding 80
is kinda frequent, but totally random and made up, and in my experience in practice always wrong by far.
It is compile time and simple logic, so mostly bug free. Works for newString
, newSeqOfCap
, etc.
Maybe a better algo is found on the future, and the proc can be improved without breaking api.
Maybe you can add extra "padding" if theres lots of RAM installed ?.
Worse case scenario is always better than a hardcoded arbitrary number, like a cache you hit the happy path frequently.
Looks good to me. Please convince Araq given a great and comprehensive benchmark or other supporting materials :)
refs https://github.com/nim-lang/Nim/pull/16207#discussion_r536478385
more efficient than
strip(leading = false)
or evenstrip
Note that inplace also makes sense without
leading = false
: it amounts to amemmv
in the case where there's a prefix to strip.