unicode support improvements: wider glyph, combined glyph

kandu commented 5 years ago

new module Zed_char, support wider/combined glyphs
new module Zed_string, new string type that consists of Zed_char.t
refactored Zed_rope and Zed_re, depends on the new modules above
Zed_curosr, new changes type, distinguish postion-change between width-change, so does the functions in this module.

ghost commented 5 years ago

I'm slightly worried about memory usage when I see the new types. Zed is meant to scale well to big files, for instance that why it uses ropes. With the new representation it seems like we are going to multiply the memory used to represent a buffer in memory by a non-negligible factor. Roughly, I would say by 48 on a 64 bits machine.

kandu commented 5 years ago

I agree. If we shrink Zed_char.t, to eliminate width and size is fine, then it would be Zed_char.t = { core: UChar.t; combined: UChar.t list }, which I think is best balanced between cpu/memory/logic clean or we go further, Zed_char.t= UChar.t list it's less clean but still acceptable. To support wider/combined glyph upon utf8 would be messy and slow, IMHO.

ghost commented 5 years ago

The only way to make this scale well is to make Zed_char.t and Zed_string.t abstract and make their representation as compact as possible. It shouldn't make things more complicated for users of these modules. They'll just need to use functions rather than access fields directly.

It is acceptable if random accesses in Zed_string.t values are slow. Zed_rope mitigate this problem since it puts a upper bound on the size of Zed_string.t values it embeds.

It makes things a bit more complicated, but at the end of the day scaling is hard...

kandu commented 5 years ago

With the latest commit, It's more but not maximumly compact, as a Zed_string.t is now still a bunch of utf8 strings. Which I think is a good balance point. If it's not compact enough. I'll add some functions like extract_zChar_from_utf8, get_zChar_of_utf8 and refactor the Zed_string module as well.

ghost commented 5 years ago

Well, the most important is that Zed_string.t is compact. Indeed, the expectation is that the user might create huge ropes, and the data are stored as Zed_string.t values inside ropes.

For Zed_char.t, it is important as well but not as much, given that is rare to have large structures storing a lot of Zed_char.t values. In practice, the record representation you had with only the first two fields seems acceptable and preferable to me: it is only one more word in general compared to the UTF-8 representation and operations on it are faster.

kandu commented 5 years ago

Zed_string.t is now encoded in utf8. And all rec functions in Zed_char & Zed_string are tail-recursive to deal with big files better.

ghost commented 5 years ago

Thanks. That seems good to me. Regarding the review, I'll be honest and I probably won't have the time to review this patch before at least three weeks. If you could find someone else to review it, that would be great. I'm personally happy with the change as long as a second pair of eye has gone over the code.

If no one else is available, ping me in three weeks.

ghost commented 5 years ago

I've been thinking about this more. I have less and less time to spend on these projects, and realistically I will not find the time to do a thorough review of the patches, test them, release them and provide follow-up support for them. Still, it seems to me that the feature your are proposing is valuable for users and that you already put a non-negligible amount of effort towards it.

Given that this is a now a community maintained project, I propose to let you do all the above tasks yourself. More precisely, I propose to give your write access to zed, lambda-term and utop so that you can merge these patches, release them in opam and provide follow-up support. Just to be clear, by follow-up support, I would expect that you do the following:

if a new bug is reported, that you have a look, and if it appears to be caused by your changes, you should spend some time to debug the issue and fix it
if users have trouble upgrading to the new version of zed, lambda-term or utop following the new releases because of these changes, that you help them by answering questions they might have

I ask that you do this for a period of at least one year. After one year, it's be up to you whether you want to continue providing support or not. I will also do my best to answer questions you might have about the existing code of these projects.

Does that work for you?

kandu commented 5 years ago

Thanks for your trust.

Next, I'll stabilize the API, test and fix potential bugs, optimize the performance. I should mention that neither ocaml nor english is my native language. Sometimes I'll ask for some help on polishing code, doc, writing change logs, release notes. But I think that's all right. let's get started as you said.

ghost commented 5 years ago

Sounds good, and no problem for the language, I'm not a native english speaker either :)

@pmetzger I don't seem to have access to the settings for zed and lambda-term, could you add @kandu as a collaborator? Thanks

pmetzger commented 5 years ago

@diml I'd feel more comfortable adding someone after an initial successful contribution rather than before, but if you're okay adding them now, we could do that.

ghost commented 5 years ago

I hear you. In this case, the work to do was quite big and non-trivial and the person did it and promptly adressed the points I was concerned about. So I'm thinking that @kandu must be serious about contributing and that we can make an exception in this case. So I'm happy to add @kandu now.

pmetzger commented 5 years ago

More precisely, I propose to give your write access to zed, lambda-term and utop so that you can merge these patches, release them in opam and provide follow-up support

Okay, @diml, @kandu is now added with write access to all three projects.

ghost commented 5 years ago

Thanks

kandu commented 5 years ago

Thanks

pmetzger commented 5 years ago

Should we do a new release of Zed?

kandu commented 5 years ago

After the latest tweak of Zed_string.sub & Zed_string.sub_ofs, I think the API is stable enough to make a new release.

pmetzger commented 5 years ago

Please arrange for new releases of these tools then.

ocaml-community / zed

unicode support improvements: wider glyph, combined glyph #12