haskell / text

Haskell library for space- and time-efficient operations over Unicode text.
http://hackage.haskell.org/package/text
BSD 2-Clause "Simplified" License
408 stars 159 forks source link

O(1) indexing on bytes #443

Open oberblastmeister opened 2 years ago

oberblastmeister commented 2 years ago

I am interested in making a pr for this. We would have to check that the index is on a correct char boundary. Partial and non partial variants would be provided

Bodigrim commented 2 years ago

Could you please elaborate with type signatures and semantics of suggested functions?

oberblastmeister commented 2 years ago

@Bodigrim

-- | O(1) returns the char given the index for the first code unit.
-- Panics if the indices are on invalid char boundaries
index :: HasCallStack => Int -> Text -> Char
indexMaybe :: Int -> Text -> Maybe Char
-- | O(1) returns a slice of the text given the start and end code units.
-- Panics if the indices are on invalid char boundaries
slice :: HasCallStack => Int -> Int -> Text -> Text
sliceMaybe :: Int -> Int -> Text -> Maybe Text

Detecting if a byte is a char boundary can be done with b < 128 || b >= 192

Bodigrim commented 2 years ago

I’m not a fan of partial APIs, to be honest.

Isn’t Data.Text.Foreign enough for your purposes?

oberblastmeister commented 2 years ago

Well, we already have a lot of partial functions. Libraries like bytestring and vector have similar functions. Functions that return Maybe can be used too. Also, I think, O(1) indexing is important. Lots of algorithms on text are implemented naturally with indices, and will return indices, such as text search. Currently, this can't be done because there is no O(1) indexing. Also, I don't think that silently ignoring incorrect indices is better than a partial function.

Lysxia commented 2 years ago

Maybe a compromise is to add this to Data.Text.Unsafe, and in the interest of promoting total functions, only the Maybe version(s). Although technically total, it still breaks the level of abstraction of "text as a sequence of Char". This isn't a rule currently but I'm suggesting that it could become one for future additions.

Lots of algorithms on text are implemented naturally with indices, and will return indices, such as text search.

I recently added spanM, which lets you do a scan and returns well-formed text slices wherever you decide to stop. Isn't that sufficient to cover a large class of text search algorithms in a well-encapsulated manner, without having to expose knowledge of UTF-8 in their implementations?

oberblastmeister commented 2 years ago

it still breaks the level of abstraction of "text as a sequence of Char".

How does this break the level of abstraction and why does that matter? We are indexing by code units which make up a Char. If they are in the middle of a Char, then it is invalid. Furthermore, Text is fundamentally not a sequence of Char, as it is a sequence of unicode scalar values. I don't see the point in simplifying Text to a sequence of Char.

Lysxia commented 2 years ago

Furthermore, Text is fundamentally not a sequence of Char, as it is a sequence of unicode scalar values. I don't see the point in simplifying Text to a sequence of Char.

OK there's a subtlety there but that's beyond my point. Replace what I said with "index breaks the level of abstraction of 'text as a sequence of unicode scalar values'" and I will stand by it.

How does this break the level of abstraction and why does that matter?

If you change the encoding of Text then you also have to change the indices that you are indexing with. This is not a property of the functions in Data.Text (except the "low-level" ones which I argue are out of place): they actually treat Text as a sequence of unicode scalar values and are agnostic to its representation (except for performance).

Maintaining this abstraction means that users don't have to know about the fact that Text is encoded as UTF-8 to effectively use Data.Text.

It is an arbitrary and fairly strong restriction, but it at least provides a well-defined threshold to control the growth of the Data.Text API. It also puts the onus on users who would want to import or reimplement index to convince themselves or their managers that what they want to do cannot be done using the safe and well-abstracted API.

Bodigrim commented 2 years ago

@oberblastmeister why https://hackage.haskell.org/package/text-2.0/docs/Data-Text-Foreign.html#g:5 are not sufficient for your purposes?

oberblastmeister commented 2 years ago

I think those functions would be good. I understand now that Text should be representation agnostic. Thank you for explaining to me.

oberblastmeister commented 2 years ago

I thought about this for a while and I am reopening because I think this is useful and also I thought of a good compromise.

First, this function is useful, especially for compilers, parsers, and text processing algorithms. In a compiler, storing only the start and end utf8 indices can be much more efficient than storing lines and columns or something else. It is also flexible, because utf8 indices can be converted to other positions easily and efficiently. This is also useful for parsers, because only tracking byte level indices and then converting those indices into other positions is much faster. For example, I made a parser combinator library that is 8-10x faster than attoparsec and megaparsec, and it also uses byte level indices.

I recently added spanM, which lets you do a scan and returns well-formed text slices wherever you decide to stop. Isn't that sufficient to cover a large class of text search algorithms in a well-encapsulated manner, without having to expose knowledge of UTF-8 in their implementations?

spanM is good, but doesn't cover use cases that are only a little more complex. Having indices allows one to move in any direction they like, while spanM can only go forwards. If you want to write your own text search algorithm, you cannot use spanM.

@oberblastmeister why https://hackage.haskell.org/package/text-2.0/docs/Data-Text-Foreign.html#g:5 are not sufficient for your purposes?

~Everything in that module is incredibly unsafe, and takeWord8 and dropWord8 are no exception.~ By contrast, utf8 indexing returns a Maybe, which means that there is no unsafety. If we were to add this function, it should not be placed in an unsafe module, because nothing about it is unsafe.

If you change the encoding of Text then you also have to change the indices that you are indexing with. This is not a property of the functions in Data.Text (except the "low-level" ones which I argue are out of place): they actually treat Text as a sequence of unicode scalar values and are agnostic to its representation (except for performance).

Maintaining this abstraction means that users don't have to know about the fact that Text is encoded as UTF-8 to effectively use Data.Text.

To solve this, we should name the function sliceUtf8 and put it in a specialized module such as Data.Text.Encoding. No abstraction is broken, because the indexing function is not in the main api, and it is explicitly marked as utf8. If text decides to change its internal encoding, the function should still work, albeit running slower. There is no reason why sliceUtf8 should break the abstraction of text anymore than decodeUtf8. In fact, you can already implement sliceUtf8 with decodeUtf8, encodeUtf8, and bytestring indexing.

Bodigrim commented 2 years ago

Everything in that module is incredibly unsafe, and takeWord8 and dropWord8 are no exception.

What's unsafe w.r.t. takeWord8 and dropWord8?

oberblastmeister commented 2 years ago

I should have been more careful because they are actually not unsafe. However I think the interface is not ergonomic and the semantics are weird. Slicing is a more common operation than taking and dropping. I can't think of a usecase where I would want taking and dropping over slicing. Silently accepting invalid utf8 indices seems dubious. It is also probably slower than just checking if the index given is valid. Also, the Foreign module seems to be semi internal, which doesn't seem to be good to rely on. It's not clear what should happen if text changes its internal representation.