tecosaur / DataToolkit.jl

Reproducible, flexible, and convenient data management
https://tecosaur.github.io/DataToolkit.jl
78 stars 4 forks source link

Validate uses of AbstractString and string indices #24

Closed jakobnissen closed 1 year ago

jakobnissen commented 1 year ago

I noticed that at least a few of your functions that take AbstractString can't actually deal with AbstractString, and/or does not handle non-ASCII strings, for example:

julia> parse(Identifier, Test.GenericString("a"))
ERROR: ArgumentError: regex matching is only available for the String type; use String(s) to convert

julia> DataToolkitBase.stringdist("ÆaÆb", "ÆaÆb")
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'Æ', [3]=>'a'

In general, I've found that

tecosaur commented 1 year ago

Hi Jakob, thanks for taking a look and making this issue. This project is beginning to move onto the testing/documenting/tweaking phase, so your comments are particularly appreciated. I'll see what I can do to make those functions better behaved.

If you might have any other comments, please do share them. I'd really like this project to be as well-constructed as I can make it, and so far I'm the only one who's really looked into the code, so I'm very keen on feedback.

jakobnissen commented 1 year ago

I really just noticed it when I copy-pasted your stringdist function to test in the context of a comment of yours. I don't use your package, just wanted to let you know

tecosaur commented 1 year ago

Yea, I'm pretty sure at this stage I'm basically the only user of this package :stuck_out_tongue:, I really want to avoid locking in bad design decisions though, which is why I'm desperate keen to get more eyes on the code.

Just mentioning this in case anything might come of it, but I understand this is a drive-by issue :slightly_smiling_face:

tecosaur commented 1 year ago

The utils.jl string functions should be multi-codepoint-safe now.