Closed stevengj closed 2 years ago
Or maybe have two types:
abstract type AbstractTupleString{N} <: AbstractString
struct TupleString{N} <: AbstractTupleString{N}
size::Int # ≤ N
data::NTuple{N,UInt8}
end
struct TupleCString{N} <: AbstractTupleString{N}
data::NTuple{N,UInt8} # NUL-terminated codeunits
end
since for storing in a C-mirrored struct you wouldn't want the size
field.
I too would propose a more descriptive name for the package. My proposal would be StaticStrings.jl: since it is in some sense the string analog of StaticArrays.jl.
I also agree with @stevengj that you don't want to the string length in the type domain.
Finally, since part of the motivation is for HDF5 support, it might also be worth considering a space-padded type: https://portal.hdfgroup.org/display/HDF5/H5T_SET_STRPAD
Finally, I would also note that if you don't store the number of code units, but enforce nul-padding (instead of just a single nul-terminating bit), then you can use binary search to find the number of code units in O(log(N)) time, instead of O(N) time.
Also, it's not clear you actually need the number of code units that often?
I would propose just 2 types:
# stores at most N-1 code units, with remaining entries being nul
struct StaticString{N}
data::NTuple{N,UInt8}
end
# stores at most N code units, with remaining entries being spaces
struct StaticFortranString{N}
data::NTuple{N,UInt8}
end
Julia strings can have NULs in them.
julia> str = "This is a Julia string with multiple NUL bytes \0\0\0\0"
"This is a Julia string with multiple NUL bytes \0\0\0\0"
julia> Tuple(codeunits.(str))
(0x54, 0x68, 0x69, 0x73, 0x20, 0x69, 0x73, 0x20, 0x61, 0x20, 0x4a, 0x75, 0x6c, 0x69, 0x61, 0x20, 0x73, 0x74, 0x72, 0x69, 0x6e, 0x67, 0x20, 0x77, 0x69, 0x74, 0x68, 0x20, 0x6d, 0x75, 0x6c, 0x74, 0x69, 0x70, 0x6c, 0x65, 0x20, 0x4e, 0x55, 0x4c, 0x20, 0x62, 0x79, 0x74, 0x65, 0x73, 0x20, 0x00, 0x00, 0x00, 0x00)
julia> NString(Tuple(codeunits.(str))) == str
true
Thus I think the main type, currently NString
should have the following properties:
N
codeunits: ncodeunits(nstring::NString{N}) where N = N
NUL
bytesThe Base.length
implementation is copy and pasted from Base
in the tradition of InlineStrings:
https://github.com/mkitti/NStrings.jl/blob/26ab95efa87b7acbc3a1d9a20b494f9dec1a1660/src/strings.jl#L138-L174
struct TupleString{N} <: AbstractTupleString{N} size::Int # ≤ N data::NTuple{N,UInt8} end
What is size
here? Is length(s::TupleString) = s.size
? Or is ncodeunits(s::TupleString) = s.size
?
Julia strings can have NULs in them.
Yes, a NUL-terminated string is not completely general. However, for the typical applications of static-storage strings, I'm not sure this matters?
But that's why I suggested two types — one with the length, and one that is NUL-terminated.
struct StaticFortranString{N}
How often does this actually come up in a context where people actually care about efficiency (i.e. they can't just do a conversion to String
or whatever)? Fortran isn't exactly famous for its string-handling libraries.
Finally, since part of the motivation is for HDF5 support, it might also be worth considering a space-padded type:
You could make the padding value a type parameter, I guess — this is something that should be known statically.
I'm fine with StaticStrings, by the way.
You could have, for example:
struct StaticString{N,pad} <: AbstractString
data::NTuple{N,UInt8} # stores up to N-1 codeunits followed by pad (repeated)
end
const StaticCString = StaticString{N,0x00} # nul-terminated
const StaticFString = StaticString{N,0x20} # space-terminated
I admit I was a bit vague above. There are 3 "lengths" we care about:
Base.length
: the number of code points (Char
s). This will always be implicit (I don't think there is any reason to store this)
Base.ncodeunits
: the number of code units
TupleString
proposes making it a field in the struct. This is type stable, but means that we can't use the type as a field in a C-compatible struct, or as a HDF5 fixed-string type.
TupleCString
, my StaticString
: make it implicit via a nul terminator, or nul padding
n^2-1
, up to a maximum of 255.Just to be clear about the provenance of the length
code, it is extracted from Julia base and mainly written by @StefanKarpinski. InlineStrings.jl also copies code from Base.
It would probably be a good idea to generalize those methods in Base to avoid this copying.
@stevengj
TupleString
proposes making it a field in the struct. This is type stable, but means that we can't use the type as a field in a C-compatible struct, or as a HDF5 fixed-string type.
We might be able to use the size if it was the second field.
Having so many tuple-based types seems unnecessarily complicated?
I would just tend to recommend a single type:
representing a size-
len
(the number of UTF-8 code units) string stored in a tuple of lengthN
, with the additional bytes set toNUL
. This seems like it should cover the use-cases.You don't want to store the length in the type (as in your
NMString
), because you want containers ofTupleString
to be concretely typed. For example, imagine an array ofTupleString
, or aTupleString
field in astruct
mirroring a C struct (whereN
is fixed but the length varies).(In which case you could call the package
TupleStrings
)