Retrieve unicode range from glyph range

jean-airoldie commented 1 year ago

Hi,

I am trying to tell apart new line characters \n from other characters without a glyph ID. Because of potential ligatures, I can't really map the glyph index in the buffer (or the glyph cluster) to a unicode character. Is there a way to do this that I am missing?

My use case is that I do font shaping once and then layout the text. The new line characters are used to force a line break.

jean-airoldie commented 1 year ago

More generally, is there a way to retrieve a unicode character range from a glyph range? Such as when multiple glyphs are selected on screen using the mouse.

RazrFalcon commented 1 year ago

rustybuzz is a very low-level library and it doesn't provide anything like that. It's a job of a text layout library.

Also, I don't think it's possible to convert glyph ID back to Unicode. Shaping is a one way process. You can match the original string characters using clusters, but that's about it. I think we had a similar question: #51

I am trying to tell apart new line characters \n from other characters without a glyph ID.

I'm not sure, but I think you should split the input string into lines before passing it to the shaper. rustybuzz operates on a single line of text.

I personally use rustybuzz only for static text layout, therefore I cannot comment on the interactive use case.

jean-airoldie commented 1 year ago

Ok, that's what i thought.

Also, I don't think it's possible to convert glyph ID back to Unicode. Shaping is a one way process.

I would be possible if rustybuzz kept track of the original unicode range associated to each glyph, and returned it in the GlyphInfo struct. In that case I would be able to refer back to the original string and detect that a unknown glyph is actually a \n for instance. However I understand that's probably out of scope of this project since you are aiming to follow haffbuzzz's design.

I'm not sure, but I think you should split the input string into lines before passing it to the shaper. rustybuzz operates on a single line of text.

That's would solve my new line character issue, but would still be a pain to deal with (and slow). And I still wouldn't be able to retrieve the unicode range anyway.

behdad commented 1 year ago

I would be possible if rustybuzz kept track of the original unicode range associated to each glyph, and returned it in the GlyphInfo struct. In that case I would be able to refer back to the original string and detect that a unknown glyph is actually a \n for instance. However I understand that's probably out of scope of this project since you are aiming to follow haffbuzzz's design.

HarfBuzz does this, in it's cluster member. I believe rustybuzz does the same.

RazrFalcon commented 1 year ago

This is honestly out of my area of expertise. rustybuzz is harfbuzz in Rust. If you want to do something unusual with it - try doing it with harfbuzz first. If it's not possible in harfbuzz then it will not be possible in rustybuzz either. There are no plans on having any additional features beyond what harfbuzz already provides.

jean-airoldie commented 1 year ago

There are no plans on having any additional features beyond what harfbuzz already provides.

Ok.

HarfBuzz does this, in it's cluster member. I believe rustybuzz does the same.

Indeed rustybuzz does have a cluster member, but I wasn't aware that it referred to the unicode graphene cluster, although that makes sense thinking back. I'll try it out to see if it fixes my issues.

behdad commented 1 year ago

Indeed rustybuzz does have a cluster member, but I wasn't aware that it referred to the unicode graphene cluster, although that makes sense thinking back.

It points back to the index in the original text string corresponding to the start of the current cluster. In your case, it should point out to the location of the \n.

jean-airoldie commented 1 year ago

It points back to the index in the original text string corresponding to the start of the current cluster. In your case, it should point out to the location of the \n.

Yes that should work then. In the case of ligatures and complex clusters and can deduce the unicode range by looking at the cluster index of the next glyph, or the end of the string, if there is no next character.

behdad commented 1 year ago

Correct.

jean-airoldie commented 1 year ago

Thanks a lot!

I'll submit a PR later to make this clearer in the doc.

behdad commented 1 year ago

https://harfbuzz.github.io/clusters.html

jean-airoldie commented 1 year ago

Yeah, I meant the rustybuzz doc.

harfbuzz / rustybuzz

Retrieve unicode range from glyph range #67