ocaml-community / zed

Abstract engine for text edition in OCaml
Other
122 stars 16 forks source link

unicode support improvements: wider glyph, combined glyph #12

Closed kandu closed 5 years ago

kandu commented 5 years ago
ghost commented 5 years ago

I'm slightly worried about memory usage when I see the new types. Zed is meant to scale well to big files, for instance that why it uses ropes. With the new representation it seems like we are going to multiply the memory used to represent a buffer in memory by a non-negligible factor. Roughly, I would say by 48 on a 64 bits machine.

kandu commented 5 years ago

I agree. If we shrink Zed_char.t, to eliminate width and size is fine, then it would be Zed_char.t = { core: UChar.t; combined: UChar.t list }, which I think is best balanced between cpu/memory/logic clean or we go further, Zed_char.t= UChar.t list it's less clean but still acceptable. To support wider/combined glyph upon utf8 would be messy and slow, IMHO.

ghost commented 5 years ago

The only way to make this scale well is to make Zed_char.t and Zed_string.t abstract and make their representation as compact as possible. It shouldn't make things more complicated for users of these modules. They'll just need to use functions rather than access fields directly.

It is acceptable if random accesses in Zed_string.t values are slow. Zed_rope mitigate this problem since it puts a upper bound on the size of Zed_string.t values it embeds.

It makes things a bit more complicated, but at the end of the day scaling is hard...

kandu commented 5 years ago

With the latest commit, It's more but not maximumly compact, as a Zed_string.t is now still a bunch of utf8 strings. Which I think is a good balance point. If it's not compact enough. I'll add some functions like extract_zChar_from_utf8, get_zChar_of_utf8 and refactor the Zed_string module as well.

ghost commented 5 years ago

Well, the most important is that Zed_string.t is compact. Indeed, the expectation is that the user might create huge ropes, and the data are stored as Zed_string.t values inside ropes.

For Zed_char.t, it is important as well but not as much, given that is rare to have large structures storing a lot of Zed_char.t values. In practice, the record representation you had with only the first two fields seems acceptable and preferable to me: it is only one more word in general compared to the UTF-8 representation and operations on it are faster.

kandu commented 5 years ago

Zed_string.t is now encoded in utf8. And all rec functions in Zed_char & Zed_string are tail-recursive to deal with big files better.

ghost commented 5 years ago

Thanks. That seems good to me. Regarding the review, I'll be honest and I probably won't have the time to review this patch before at least three weeks. If you could find someone else to review it, that would be great. I'm personally happy with the change as long as a second pair of eye has gone over the code.

If no one else is available, ping me in three weeks.

ghost commented 5 years ago

I've been thinking about this more. I have less and less time to spend on these projects, and realistically I will not find the time to do a thorough review of the patches, test them, release them and provide follow-up support for them. Still, it seems to me that the feature your are proposing is valuable for users and that you already put a non-negligible amount of effort towards it.

Given that this is a now a community maintained project, I propose to let you do all the above tasks yourself. More precisely, I propose to give your write access to zed, lambda-term and utop so that you can merge these patches, release them in opam and provide follow-up support. Just to be clear, by follow-up support, I would expect that you do the following:

I ask that you do this for a period of at least one year. After one year, it's be up to you whether you want to continue providing support or not. I will also do my best to answer questions you might have about the existing code of these projects.

Does that work for you?

kandu commented 5 years ago

Thanks for your trust.

Next, I'll stabilize the API, test and fix potential bugs, optimize the performance. I should mention that neither ocaml nor english is my native language. Sometimes I'll ask for some help on polishing code, doc, writing change logs, release notes. But I think that's all right. let's get started as you said.

ghost commented 5 years ago

Sounds good, and no problem for the language, I'm not a native english speaker either :)

@pmetzger I don't seem to have access to the settings for zed and lambda-term, could you add @kandu as a collaborator? Thanks

pmetzger commented 5 years ago

@diml I'd feel more comfortable adding someone after an initial successful contribution rather than before, but if you're okay adding them now, we could do that.

ghost commented 5 years ago

I hear you. In this case, the work to do was quite big and non-trivial and the person did it and promptly adressed the points I was concerned about. So I'm thinking that @kandu must be serious about contributing and that we can make an exception in this case. So I'm happy to add @kandu now.

pmetzger commented 5 years ago

More precisely, I propose to give your write access to zed, lambda-term and utop so that you can merge these patches, release them in opam and provide follow-up support

Okay, @diml, @kandu is now added with write access to all three projects.

ghost commented 5 years ago

Thanks

kandu commented 5 years ago

Thanks

pmetzger commented 5 years ago

Should we do a new release of Zed?

kandu commented 5 years ago

After the latest tweak of Zed_string.sub & Zed_string.sub_ofs, I think the API is stable enough to make a new release.

pmetzger commented 5 years ago

Please arrange for new releases of these tools then.