mkitti / StaticStrings.jl

Fixed-length strings in Julia represented by NTuples
MIT License
13 stars 2 forks source link

too many types? #1

Closed stevengj closed 2 years ago

stevengj commented 2 years ago

Having so many tuple-based types seems unnecessarily complicated?

I would just tend to recommend a single type:

struct TupleString{N} <: AbstractString
    size::Int # ≤ N
    data::NTuple{N,UInt8}
end

representing a size-len (the number of UTF-8 code units) string stored in a tuple of length N, with the additional bytes set to NUL. This seems like it should cover the use-cases.

You don't want to store the length in the type (as in your NMString), because you want containers of TupleString to be concretely typed. For example, imagine an array of TupleString, or a TupleString field in a struct mirroring a C struct (where N is fixed but the length varies).

(In which case you could call the package TupleStrings)

stevengj commented 2 years ago

Or maybe have two types:

abstract type AbstractTupleString{N} <: AbstractString

struct TupleString{N} <: AbstractTupleString{N}
    size::Int # ≤ N
    data::NTuple{N,UInt8}
end

struct TupleCString{N} <: AbstractTupleString{N}
    data::NTuple{N,UInt8} # NUL-terminated codeunits
end

since for storing in a C-mirrored struct you wouldn't want the size field.

simonbyrne commented 2 years ago

I too would propose a more descriptive name for the package. My proposal would be StaticStrings.jl: since it is in some sense the string analog of StaticArrays.jl.

I also agree with @stevengj that you don't want to the string length in the type domain.

Finally, since part of the motivation is for HDF5 support, it might also be worth considering a space-padded type: https://portal.hdfgroup.org/display/HDF5/H5T_SET_STRPAD

simonbyrne commented 2 years ago

Finally, I would also note that if you don't store the number of code units, but enforce nul-padding (instead of just a single nul-terminating bit), then you can use binary search to find the number of code units in O(log(N)) time, instead of O(N) time.

simonbyrne commented 2 years ago

Also, it's not clear you actually need the number of code units that often?

I would propose just 2 types:

# stores at most N-1 code units, with remaining entries being nul
struct StaticString{N}
   data::NTuple{N,UInt8}
end

# stores at most N code units, with remaining entries being spaces
struct StaticFortranString{N}
   data::NTuple{N,UInt8}
end
mkitti commented 2 years ago

Julia strings can have NULs in them.

julia> str = "This is a Julia string with multiple NUL bytes \0\0\0\0"
"This is a Julia string with multiple NUL bytes \0\0\0\0"

julia> Tuple(codeunits.(str))
(0x54, 0x68, 0x69, 0x73, 0x20, 0x69, 0x73, 0x20, 0x61, 0x20, 0x4a, 0x75, 0x6c, 0x69, 0x61, 0x20, 0x73, 0x74, 0x72, 0x69, 0x6e, 0x67, 0x20, 0x77, 0x69, 0x74, 0x68, 0x20, 0x6d, 0x75, 0x6c, 0x74, 0x69, 0x70, 0x6c, 0x65, 0x20, 0x4e, 0x55, 0x4c, 0x20, 0x62, 0x79, 0x74, 0x65, 0x73, 0x20, 0x00, 0x00, 0x00, 0x00)

julia> NString(Tuple(codeunits.(str))) == str
true

Thus I think the main type, currently NString should have the following properties:

  1. Has exactly N codeunits: ncodeunits(nstring::NString{N}) where N = N
  2. Can represent any Julia string
  3. Can contain NUL bytes
  4. Has the length calculated via binary search.

The Base.length implementation is copy and pasted from Base in the tradition of InlineStrings: https://github.com/mkitti/NStrings.jl/blob/26ab95efa87b7acbc3a1d9a20b494f9dec1a1660/src/strings.jl#L138-L174

mkitti commented 2 years ago
struct TupleString{N} <: AbstractTupleString{N}
    size::Int # ≤ N
    data::NTuple{N,UInt8}
end

What is size here? Is length(s::TupleString) = s.size? Or is ncodeunits(s::TupleString) = s.size?

stevengj commented 2 years ago

Julia strings can have NULs in them.

Yes, a NUL-terminated string is not completely general. However, for the typical applications of static-storage strings, I'm not sure this matters?

But that's why I suggested two types — one with the length, and one that is NUL-terminated.

struct StaticFortranString{N}

How often does this actually come up in a context where people actually care about efficiency (i.e. they can't just do a conversion to String or whatever)? Fortran isn't exactly famous for its string-handling libraries.

stevengj commented 2 years ago

Finally, since part of the motivation is for HDF5 support, it might also be worth considering a space-padded type:

You could make the padding value a type parameter, I guess — this is something that should be known statically.

stevengj commented 2 years ago

I'm fine with StaticStrings, by the way.

You could have, for example:

struct StaticString{N,pad} <: AbstractString
    data::NTuple{N,UInt8} # stores up to N-1 codeunits followed by pad (repeated)
end

const StaticCString = StaticString{N,0x00} # nul-terminated
const StaticFString = StaticString{N,0x20} # space-terminated
simonbyrne commented 2 years ago

I admit I was a bit vague above. There are 3 "lengths" we care about:

mkitti commented 2 years ago

Just to be clear about the provenance of the length code, it is extracted from Julia base and mainly written by @StefanKarpinski. InlineStrings.jl also copies code from Base.

https://github.com/JuliaLang/julia/blob/f1c4d54696c2d296a10a8f8d659f6a3e85da9a29/base/strings/string.jl#L287-L323

It would probably be a good idea to generalize those methods in Base to avoid this copying.

mkitti commented 2 years ago

@stevengj TupleString proposes making it a field in the struct. This is type stable, but means that we can't use the type as a field in a C-compatible struct, or as a HDF5 fixed-string type.

We might be able to use the size if it was the second field.