Open oberblastmeister opened 2 years ago
Could you please elaborate with type signatures and semantics of suggested functions?
@Bodigrim
-- | O(1) returns the char given the index for the first code unit.
-- Panics if the indices are on invalid char boundaries
index :: HasCallStack => Int -> Text -> Char
indexMaybe :: Int -> Text -> Maybe Char
-- | O(1) returns a slice of the text given the start and end code units.
-- Panics if the indices are on invalid char boundaries
slice :: HasCallStack => Int -> Int -> Text -> Text
sliceMaybe :: Int -> Int -> Text -> Maybe Text
Detecting if a byte is a char boundary can be done with b < 128 || b >= 192
I’m not a fan of partial APIs, to be honest.
Isn’t Data.Text.Foreign
enough for your purposes?
Well, we already have a lot of partial functions. Libraries like bytestring
and vector
have similar functions. Functions that return Maybe
can be used too. Also, I think, O(1) indexing is important. Lots of algorithms on text are implemented naturally with indices, and will return indices, such as text search. Currently, this can't be done because there is no O(1) indexing. Also, I don't think that silently ignoring incorrect indices is better than a partial function.
Maybe a compromise is to add this to Data.Text.Unsafe
, and in the interest of promoting total functions, only the Maybe
version(s). Although technically total, it still breaks the level of abstraction of "text as a sequence of Char
". This isn't a rule currently but I'm suggesting that it could become one for future additions.
Lots of algorithms on text are implemented naturally with indices, and will return indices, such as text search.
I recently added spanM
, which lets you do a scan and returns well-formed text slices wherever you decide to stop. Isn't that sufficient to cover a large class of text search algorithms in a well-encapsulated manner, without having to expose knowledge of UTF-8 in their implementations?
it still breaks the level of abstraction of "text as a sequence of Char".
How does this break the level of abstraction and why does that matter? We are indexing by code units which make up a Char
. If they are in the middle of a Char
, then it is invalid. Furthermore, Text
is fundamentally not a sequence of Char
, as it is a sequence of unicode scalar values. I don't see the point in simplifying Text
to a sequence of Char
.
Furthermore, Text is fundamentally not a sequence of Char, as it is a sequence of unicode scalar values. I don't see the point in simplifying Text to a sequence of Char.
OK there's a subtlety there but that's beyond my point. Replace what I said with "index
breaks the level of abstraction of 'text as a sequence of unicode scalar values'" and I will stand by it.
How does this break the level of abstraction and why does that matter?
If you change the encoding of Text
then you also have to change the indices that you are indexing with. This is not a property of the functions in Data.Text
(except the "low-level" ones which I argue are out of place): they actually treat Text
as a sequence of unicode scalar values and are agnostic to its representation (except for performance).
Maintaining this abstraction means that users don't have to know about the fact that Text
is encoded as UTF-8 to effectively use Data.Text
.
It is an arbitrary and fairly strong restriction, but it at least provides a well-defined threshold to control the growth of the Data.Text
API. It also puts the onus on users who would want to import or reimplement index
to convince themselves or their managers that what they want to do cannot be done using the safe and well-abstracted API.
@oberblastmeister why https://hackage.haskell.org/package/text-2.0/docs/Data-Text-Foreign.html#g:5 are not sufficient for your purposes?
I think those functions would be good. I understand now that Text
should be representation agnostic. Thank you for explaining to me.
I thought about this for a while and I am reopening because I think this is useful and also I thought of a good compromise.
First, this function is useful, especially for compilers, parsers, and text processing algorithms. In a compiler, storing only the start and end utf8 indices can be much more efficient than storing lines and columns or something else. It is also flexible, because utf8 indices can be converted to other positions easily and efficiently. This is also useful for parsers, because only tracking byte level indices and then converting those indices into other positions is much faster. For example, I made a parser combinator library that is 8-10x faster than attoparsec and megaparsec, and it also uses byte level indices.
I recently added spanM, which lets you do a scan and returns well-formed text slices wherever you decide to stop. Isn't that sufficient to cover a large class of text search algorithms in a well-encapsulated manner, without having to expose knowledge of UTF-8 in their implementations?
spanM
is good, but doesn't cover use cases that are only a little more complex. Having indices allows one to move in any direction they like, while spanM
can only go forwards. If you want to write your own text search algorithm, you cannot use spanM
.
@oberblastmeister why https://hackage.haskell.org/package/text-2.0/docs/Data-Text-Foreign.html#g:5 are not sufficient for your purposes?
~Everything in that module is incredibly unsafe, and takeWord8
and dropWord8
are no exception.~ By contrast, utf8 indexing returns a Maybe
, which means that there is no unsafety. If we were to add this function, it should not be placed in an unsafe module, because nothing about it is unsafe.
If you change the encoding of Text then you also have to change the indices that you are indexing with. This is not a property of the functions in Data.Text (except the "low-level" ones which I argue are out of place): they actually treat Text as a sequence of unicode scalar values and are agnostic to its representation (except for performance).
Maintaining this abstraction means that users don't have to know about the fact that Text is encoded as UTF-8 to effectively use Data.Text.
To solve this, we should name the function sliceUtf8
and put it in a specialized module such as Data.Text.Encoding
. No abstraction is broken, because the indexing function is not in the main api, and it is explicitly marked as utf8. If text
decides to change its internal encoding, the function should still work, albeit running slower. There is no reason why sliceUtf8
should break the abstraction of text
anymore than decodeUtf8
. In fact, you can already implement sliceUtf8
with decodeUtf8
, encodeUtf8
, and bytestring
indexing.
Everything in that module is incredibly unsafe, and
takeWord8
anddropWord8
are no exception.
What's unsafe w.r.t. takeWord8
and dropWord8
?
I should have been more careful because they are actually not unsafe. However I think the interface is not ergonomic and the semantics are weird. Slicing is a more common operation than taking and dropping. I can't think of a usecase where I would want taking and dropping over slicing. Silently accepting invalid utf8 indices seems dubious. It is also probably slower than just checking if the index given is valid. Also, the Foreign
module seems to be semi internal, which doesn't seem to be good to rely on. It's not clear what should happen if text
changes its internal representation.
I am interested in making a pr for this. We would have to check that the index is on a correct char boundary. Partial and non partial variants would be provided