Find-based `replace` implementation

hiiamboris commented 1 year ago

Rationale

Current replace implementation was stiched together by different people with different small needs and no central design:

it switches from find to parse mode when /deep is used and in other hard to predict cases, totally changing semantics and disabling the possibility of /same comparison and fast hash lookups
usage of active values (functions), which works only in find mode, adds loop properties to it and makes a little step into map-each/mapparse territory
usage of change makes it of quadratic complexity, seriously limiting applicability
its internal logic is extremely hard to follow, so only tests can ensure some level of validity
some of the find features (/only, /same, /part) are not supported by it
most used mode /all is not the default one

See also https://github.com/red/red/issues/4945 previously ignored

Proposal

New implementation:

removes Parse from replace (Parse is left to mapparse)
removes any 'active' values from it - if value is a function, it's just inserted
adds support for /only, /same and /part by the means of apply https://github.com/red/REP/issues/83
makes /all the default mode, leaving single replacement to /once https://github.com/red/REP/issues/18
no /tail, as it's integrated into /once, but it's trivial to add it to the default (/all) mode if the need is shown

Arguments in favor of mapparse and against replace/parse refinement:

docstrings become very hard to write, ambiguous (main reason)
zero shared code
not only pattern, but also value changes meaning in /parse mode
replace expects a constant value (it's optimized for that), while mapparse evaluates a code block (how to tell /parse mode if block should be treated as code or as a value)?
in /parse mode: /same makes no sense, /only does not apply to pattern anymore
mapparse is consistent with forparse, and if we integrate it into replace, what to do with forparse?
forparse and mapparse are loops, so they support break & continue in their code blocks, so their logic is different
as a loop, mapparse has different argument order - same as the other loops (foreach, repeat)
extra refinement increases call length

Performance comparison

I have made two variants of the new implementation: replace2 is straightforward and compiler-friendly, while replace1 uses blocks of code to make the loop tighter (but they are interpreted). replace/all is the old implementation.

in place, big buffer
209 ms  [replace1 block 0 1]
106 ms  [replace2 block 1 2]
2486 ms [replace/all block 2 0]

length change, big buffer
216 ms  [replace1 block 0 [1 2]]
107 ms  [replace2 block 0 [1 2]]
7040 ms [replace/all block 0 [1 2]]

length change, mid buffer
421 ms  [loop 100 [replace1 block 0 [1 2] replace1 block [1 2] 0]]
214 ms  [loop 100 [replace2 block 0 [1 2] replace2 block [1 2] 0]]
318 ms  [loop 100 [replace/all block 0 [1 2] replace/all block [1 2] 0]]

length change, small buffer
672 ms  [loop 10000 [replace1 block 0 [1 2] replace1 block [1 2] 0]]
328 ms  [loop 10000 [replace2 block 0 [1 2] replace2 block [1 2] 0]]
254 ms  [loop 10000 [replace/all block 0 [1 2] replace/all block [1 2] 0]]

length change, mid buffer (hash)
2729 ms [loop 100 [replace1 hash 0 [1 2] replace1 hash [1 2] 0]]
2225 ms [loop 100 [replace2 hash 0 [1 2] replace2 hash [1 2] 0]]
27395 ms        [loop 100 [replace/all hash 0 [1 2] replace/all hash [1 2] 0]]

length change, small buffer (hash)
797 ms  [loop 10000 [replace1 hash 0 [1 2] replace1 hash [1 2] 0]]
504 ms  [loop 10000 [replace2 hash 0 [1 2] replace2 hash [1 2] 0]]
669 ms  [loop 10000 [replace/all hash 0 [1 2] replace/all hash [1 2] 0]]

init time
1064 ms [repeat i 100000 [replace1 block i - 1 i]]
455 ms  [repeat i 100000 [replace2 block i - 1 i]]
459 ms  [repeat i 100000 [replace/all block i - 1 i]]

Due to compiler benefits replace2 usually wins over replace1. And considering that get-words in paths cannot be compiled yet, and it includes workarounds for https://github.com/red/red/issues/5319 and https://github.com/red/red/issues/5320 , it should perform better once those kludges are removed. Compared to the old implementation, performance is much more predictable and linear, with worst case slowdown currently down to 23% due to allocation of a new buffer.

Tests

replace test suite adapted for the new version is here. Some tests were refitted with mapparse. Some currently fail due to https://github.com/red/red/issues/5321

greggirwin commented 1 year ago

Great stuff @hiiamboris.

if all [deep once] [cause-error 'script 'bad-refines []] ;-- incompatible

Isn't it possible that you'd want to replace a single value, anywhere in a deep structure? I agree that the normal case is /all with /deep, and this could be relaxed later if asked for. First use case I think of is that some data value shows up in an error scenario, and you want to test with the data to see if it happens in every case, or what related data there may be.

greggirwin commented 1 year ago

unless any-block? series [deep: off]

This is an interesting one. If we remove paren support from paths, they shouldn't be nestable, right?

greggirwin commented 1 year ago

Also on paths, are we sure we want this behavior?

>> replace1/deep copy [a b c [a/b/c]] 'b 'x
== [a x c [a/x/c]]

In some cases that will be exactly what we want, but should we always treat block/list the same?

greggirwin commented 1 year ago

Does size need to be dynamic?

;; pattern size will be found out after first match:
size: [size: offset? match find/:case/:same/:only/match/tail match :pattern]

It's niftier than size: either only [1][length? pattern], but even with the comment I had to scan and parse things a few times to figure it out as future references and do use are tricky. It is a nice use of fixed+dynamic refinements though.

greggirwin commented 1 year ago

Setup thought, just more straight-line:

start: series                   ;-- starting offset may be adjusted if part is negative
limit: system/words/case [
    none? limit [tail series]
    integer? limit [skip start limit]   ;-- convert limit to series, or will have to update it all the time
    ; no action if already a series.
]
if back?: negative? offset? start limit [   ;-- flip start/end if limit was negative, always go forward
    start: limit
    limit: series
]

greggirwin commented 1 year ago

I'm reminded that I don't love =?, preferring same?. I don't think it's a great symbol, given the meaning. === seems better if we're escalating comparison strictness, from = == ===, but even then I prefer same? for explicitness.

hiiamboris commented 1 year ago

Isn't it possible that you'd want to replace a single value, anywhere in a deep structure? I agree that the normal case is /all with /deep, and this could be relaxed later if asked for. First use case I think of is that some data value shows up in an error scenario, and you want to test with the data to see if it happens in every case, or what related data there may be.

This case sounds made up to me ;) But anyway, if replace/once returns position after the replacement, what should it return in deep mode? Another series? How do you continue your replacement from another series? This is the main issue.

unless any-block? series [deep: off]

This is an interesting one. If we remove paren support from paths, they shouldn't be nestable, right?

Paths constructed at runtime will still be.

Also on paths, are we sure we want this behavior?
>> replace1/deep copy [a b c [a/b/c]] 'b 'x
== [a x c [a/x/c]]
In some cases that will be exactly what we want, but should we always treat block/list the same?

Not sure, but paths are troublesome without it. Question is, is it worth another refinement, e.g. /paths?

There's also another direction here: to run deep replacement on all inner series - strings, binaries, vectors. Should we have control over it or we just forward everyone into mapparse with this?

Does size need to be dynamic?
;; pattern size will be found out after first match:
size: [size: offset? match find/:case/:same/:only/match/tail match :pattern]
It's niftier than size: either only [1][length? pattern], but even with the comment I had to scan and parse things a few times to figure it out as future references and do use are tricky. It is a nice use of fixed+dynamic refinements though.

Let it be your homework then to find cases where size: either only [1][length? pattern] doesn't work :) I'm using find because my head would explode from an attempt to figure out a reliable future-proof and concise way of manually keeping it in sync with find.

Setup thought, just more straight-line:

That may be less nested, but also heavier for the no-limit case.

I haven't by the way considered the past-tail index case here. I'd prefer a language-wide solution for it, rather than patching holes arbitrarily.

greggirwin commented 1 year ago

Good thoughts. More questions. :^\

greggirwin commented 1 year ago

That may be less nested, but also heavier for the no-limit case.

I'm all for profiling it in the grand scheme of replace. :^)

hiiamboris commented 1 year ago

I'm all for profiling it in the grand scheme of replace. :^)

OK (init time, sample size=10):

greggirwin commented 1 year ago

Cool, I can't tell what those numbers mean though, unless it makes the whole thing 10% slower, which is surprising.

hiiamboris commented 1 year ago

It's no surprise. offset? is a mezz.

greggirwin commented 1 year ago

OK then, just flip the blocks so the check is none? limit.

hiiamboris commented 1 year ago

Done.

greggirwin commented 1 year ago

Cool. How is =? better than none? here, for my edification, as it's non-idiomatic?

hiiamboris commented 1 year ago

Cool. How is =? better than none? here, for my edification, as it's non-idiomatic?
> clock/times [1 =? none] 1e7
0.12 μs [1 =? none]
> clock/times [none? 1] 1e7
0.25 μs [none? 1]

dockimbel commented 1 year ago

unless pos =? match

I stumbled for a couple seconds there, as I wondered what was =?... Seriously, everybody is using the much more readable same?, we should remove =? entirely, as it's really non-intuitive. You could simply write unless same? pos match...

Also:

>> clock/times [none =? limit] 1e7
0.06 μs [none =? limit]
== false
>> clock/times [not limit] 1e7
0.06 μs [not limit]

greggirwin commented 1 year ago

@hiiamboris, which was why I said " in the grand scheme of replace" initially. profile tells the story nicely.

+1 for removing =?, but then I feel the same about unless. Do you remember why Carl added it? :^)

red / REP