Open Araq opened 4 years ago
I think slices should be of type lent
and should copy on write.
For mutable views maybe a mutable equivilalent of lent
called view
should be introduced.
We could also do this: In all assignments let a = b
where a
doesn't escape the stackframe, make a
lent
.
If we were to later on pass a
to a proc that takes a
as a sink
or normal parameter, we could detect that and convert let a = b
to a =
or =sink
assigment. (Maybe only when it can be converted to a =sink
assigment? And in all other cases COW instead?)
Then we would benefit from parameters being lent
if possible, so maybe some kind of inference as we do for sink now would be possible.
I agree with @Clyybber before going all the way to COW we need to use the sink/move/lent optimizations.
Second, there are 2 performance issues with strings:
I'd argue that COW is solving the wrong problem (problem 1). The copies are often the result of a slice that stay in the same stack frame and can benefit from lent
and/or openarray as values (#88).
And the most important performance issue is intermediate allocations. dup
is a step in the right direction. We need a way to compose string transformations that minimize the allocations, hence re-use buffers. It can be zero-functional
inspired, it can be state-machine based, it can be C++ or D ranges
inspired.
COW doesn't solve the problem of allocation of intermediate results.
Sources (i.e. benchmark should be inspired by the kind of code people are trying to write here):
Let me say it differently: --gc:arc
already has COW, but limited. It seems to me there is no additional runtime overhead to make it more general. The cost is a refcount per string, one machine word.
It's a very basic type, it should be as simple as possible given that it will be used in many contexts including embedded, async and multithreading.
Does the refcount become atomic with --threads:on?
As simple as possible is covered by openArray[char]
which is C++'s string_view in all but its name.
Does the refcount become atomic with --threads:on?
Unlikely, we'll probably keep doing =deepCopy for threading.
Wouldn't copy-on-write be addressed better in the type system instead of through run-time flags?
A string literal can have a different type that is implicitly convertible to an immutable string, but also triggers a copy when being passed as var string
parameter.
With that said, I like the idea of allowing the allocated buffer of strings and sequences to be shared and reference counted without introducing an indirection. Something along the lines of C++'s allocated_shared comes to mind where the ref count appears in the memory just before the normal string value.
The holy grail in the strings vs sequence debates for me is figuring out a way how to make string
the same type as seq[byte]
(or merely distinct types). The fact that they are different types can lead to a lot of unnecessary copying of buffers in networking and serialization libraries. For the uninitiated, the obvious problem here is that the null terminator of strings is required for the C interop. I dream of a solution where the null terminator will be inserted on-demand when you trigger the cstring
conversion, but then the question is how do you ensure that there is a place for it?
One possible answer is that the cstring
conversion is a place where you introduce a copy of the string if its buffer cannot be patched in place with a zero past the last character, but the other more crazy solution is that you leave something at the end of the string allocation cell (e.g. the ref count of the flag we are discussing here). If you have left just one byte there, you can temporarily turn it to a null terminator for the duration of the cstring
call.
Combining these two ideas, the ref counted strings can store their ref count at the end of the allocated cell for the buffer. The price you'll pay is that the string capacity is slightly decreased. You can check the capacity value to recognize the type of string you are working with, so you can support transitions between the ref-counted mode and the single-owner mode.
Combining these two ideas, the ref counted strings can store their ref count at the end of the allocated cell for the buffer.
That seems to be very bad for locality / CPU caches.
For the uninitiated, the obvious problem here is that the null terminator of strings is required for the C interop.
In practice, for many programs the IO primitives offered by C often have APIs that take char* data, size_t len
pairs and so don't need the null terminator. The most common exception is filenames and both Rust and C++ use dedicated, non-string types for Path
. Then Path
can be zero-terminated under the hood whereas cstring(nimstr)
can be more expensive and do the alloc+copy for the zero terminator.
Given the other obvious benefits (type safety!), maybe it's time to replace the os
and io
subsystems with something better.
Well, I'll be happy if seq[byte]
and string
get the same run-time represenation, one way or another.
Otherwise, to explain the idea above more precisely, the cache locality argument shouldn't be that bad, because:
A) We can consider the ref counting case as more rare and not worth optimising.
To achieve this we only need a fast check for the question "Is this string ref-counted?". We can do this by checking a single bit in the capacity
value (all normal allocations will have even capacity
values because the memory allocator works with such cells, while the ref counted ones will use odd capacity
values). Please notice how in this scheme, it's always possible to switch from the ref counted mode to normal ownership and the entire capacity is recovered when you do this.
A read of the capacity
field is not a problem for cache locality, so this problem is gone.
B) As SSO proclaims, most strings are small (smaller than the cache line of a modern processor, which is likely to continue increasing). Thus, quite often, the ref count will also be in the CPU cache anyway.
I had a read, feels that sink T
, lent T
, var T
and openarray[T]
combined with https://github.com/nim-lang/RFCs/issues/178 is more generic solution to the optimization/performance problems.
It does feel odd that string
becomes cow
while the rest of types don't. Isn't it better to have a generic cow
solution that can turn any type into in cow
if user wants it?
I dream of a solution where the null terminator will be inserted on-demand when you trigger the cstring conversion, but then the question is how do you ensure that there is a place for it?
C++ dreamed of that too, with c_str
being a const method that mutated the string nonetheless in some libraries - these strategies however quickly ran into issues because strings can no longer be consts (ie stored in the readonly section of the executable) and they violated the signature of the function causing trouble downstream for optimizations and downstream analysis tools that now couldn't make other optimizations - one step forwards, two steps backward.
Instead of making string
and seq[byte]
the same, it seems like one could exploit their differences somehow instead - after all, why go through all the trouble of having two types if you end up making them the same?
any cow solution doesn't just introduce one word of refcount overhead - it also forces every operation to maintain that refcount and introduces complex / costly operations in unexpected places - ie mystring[4] = 'c'
is suddenly a potential O(N)
operation which is pretty nuts.
SSO is efficient for several reasons:
borrowing is indeed the systemic solution to this problem - if it was introduced as an optimization, it could analyze away copies so the cost to the programmer in terms of having to deal with syntax would be kept low - the way to approach the problem then would be to design the string to be more amenable to this kind of analysis by constraining what you can do with it.
cstring(nimstr)
reallocating here opens up a problematic memory safety issue as well - who owns the memory of proc x(s: string): cstring = cstring(x)
?
strings can no longer be consts (ie stored in the readonly section of the executable)
It's trivial to make the const-section strings zero-terminated
violated the signature of the function
With const_cast
existing in the language, I cannot see how any tool can rely on the signature to enable or disable certain optimisations.
it seems like one could exploit their differences
There aren't enough differences to justify the code bloat and the buffer copies needed in conversions that stem from having two different types.
reallocating here opens up a problematic memory safety issue as well - who owns the memory of proc x(s: string): cstring = cstring(x)?
This memory can certainly be freed if you consider the code to be equivalent to:
callSomeCFunction(TempCString(data: createCStringCopy(nimString)))
where TempCString
is a type with a destructor that will deallocate the memory. If we follow C++ rules, the destructor will be called at the semi-colon of the enclosing expression (i.e. just after the callSomeCFunction
completes)
So here is a working implementation https://github.com/planetis-m/dumpster/blob/master/cow/cowstrings.nim in case anyone wants to benchmark it.
And here is a benchmark of std.strings/cowstrings/ssostrings. https://github.com/planetis-m/ssostrings/blob/master/tests/bench1.nim
Also relevant: http://www.gotw.ca/publications/optimizations.htm
Motivation: Make naive string handling code faster
In the ARC/ORC implementation string literals are already not copied and not destroyed. When these strings are potentially mutated
nimPrepareStrMutationV2
is called to copy the constant string into a buffer that allows for mutations. Currently the condition(s.p.cap and strlitFlag) == strlitFlag
is used to determine whethers.p
points to a constant string.In Delphi/FPC a refcount of -1 is used instead to allow for a more general COW mechanism. This RFC proposes to copy this design as much everyday Nim code is written without the current expensive copies in mind, e.g.:
This is particularly troublesome for
async
procs where the snippetenv.field = param
is introduced so that parameters can be captured inside closures. However, asink
parameter probably has the same benefit.As a side effect Nim's ARC strings can support the old
GC_ref
/GC_unref
calls.Expected compat problems
It means that
cast[seq[char]](mystr)
stops working. We can detect this case in the compiler andfusion / compat
can introducecastToSeq
/castToString
cast operations that work for all runtimes.Alternatives/additions considered but rejected (for now)
Make Nim's strings use C++'s SSO
SSO looks like a rather fragile questionable optimization, so copying short strings around doesn't involve touching the heap, yet long strings are as slow as before to copy? And almost every operation internally needs to distinguish between "is this a long or a shorts string". Also, in Nim filenames/paths are currently not a dedicated type and directly stored as string too; most paths are longer than what is covered by the "short string" representation.
Make Nim's string slices O(1)
People coming from Go (and for some reason also people coming from Python) expect
substr
ands[1..3]
to be faster than they really are. In the old runtime this was even worse as string allocations cause the occasional GC run. In the current new runtime the allocations remain. If the string implementation is changed fromto
The slice operation can be done in O(1) time. However, every slice creation would cause a
inc(s.p.refcount)
operation and also involve a corresponding destruction step. This probably means that it's not fast enough in practice and so our effort is better spent on using moreopenArray[char]
with borrowing rules instead. This also has the benefit that short string slices cannot keep a multi megabyte sized string buffer alive for too long; a problem big enough in the JVM world that they moved from O(1) slicing to O(N) slicing in a patch release.Implementation effort
strs_v2.nim