Script search - Githubissues

whereswaldon commented 12 months ago

This PR extends the work done in #85 to index known fonts by script coverage and to use that information to resolve faces when our other heuristics fail. This allows us to successfully display almost[1] any rune if the system has a font with coverage.

[1] I say "almost" because the way that I index script support can theoretically miss scripts. I only check the script of the first rune of each page of supported runes. If a script has a very narrow band of runes belonging to it, it's possible that this mechanism won't see it. However, it would need to be less than 32 runes and would need to all happen to align within a single page of our rune coverage set, which is pretty unlikely. @benoitkugler I'd love suggestions on a better way to generate this if you have any.

This PR works for me, but I've marked it as a draft because it depends upon the work in #85.

Fixes #86

benoitkugler commented 12 months ago

I like the overall logic here, nicely done 👍.

I'll open a PR with an alternative way to build the script set as soon as possible.

This point aside, I think it would be better to only compute the script set once (and to store it on the footprint), to avoid paying the cost for each new fontmap. Unless you are sure the cost may be neglected ?

whereswaldon commented 12 months ago

I definitely would like to reduce the cost. I think we may want to build this map once and serialize it as part of the index so that we don't need to build it ever again, but I figured that was overkill for a first draft. I definitely think cost is a factor. Please feel free to do this more efficiently. :D

benoitkugler commented 12 months ago

I definitely would like to reduce the cost. I think we may want to build this map once and serialize it as part of the index so that we don't need to build it ever again, but I figured that was overkill for a first draft. I definitely think cost is a factor. Please feel free to do this more efficiently. :D

Sure, it makes sense ! I'll try to do it as well.

whereswaldon commented 12 months ago

@benoitkugler A supplemental problem here is that we can end up re-doing the work to resolve the same rune over and over again. Whenever Gio re-shapes text, it redoes bidi splitting and splitting by rune coverage. If none of the faces match, we can quite easily end up doing this expensive map traversal many times. I'm thinking of adding a rune resolver cache. We can limit its size and invalidate the whole thing whenever the set of available fonts changes.

whereswaldon commented 12 months ago

The two commits I just pushed implement a simple LRU for resolving runes, allowing us to skip traversing footprints if we've done it recently. I think the cache size needs to be tunable and maybe there are better ways to go about it, but I've pushed it in the interest of showing the idea. I'm out of time for today.

benoitkugler commented 12 months ago

The last commit I've pushed will be useful to store the scriptSet on the footprint. I'll do it when we figure out the best way to create it.

benoitkugler commented 12 months ago

@whereswaldon Could you confirm that the lru cache supersedes the use of the lastFootprintIndex in ResolveFace ? I think it is now redundant and we should remove the lastFootprintIndex field.

whereswaldon commented 12 months ago

@whereswaldon Could you confirm that the lru cache supersedes the use of the lastFootprintIndex in ResolveFace ? I think it is now redundant and we should remove the lastFootprintIndex field.

It certainly does; thanks for pointing that out!

On another note, I think we're making a backwards incompatible change to the index serialization format. Should we increment the version number and retain the ability to parse the prior version? I'm not aware of anyone actually deploying applications using our font index yet, so I suppose we can still make breaking changes, but pretty soon we can't do that anymore.

benoitkugler commented 12 months ago

On another note, I think we're making a backwards incompatible change to the index serialization format. Should we increment the version number and retain the ability to parse the prior version? I'm not aware of anyone actually deploying applications using our font index yet, so I suppose we can still make breaking changes, but pretty soon we can't do that anymore.

Yes, you're right ! I had similar thoughs. We should indeed bump the version.

whereswaldon commented 11 months ago

@benoitkugler I've done my best to make the serialization logic backwards compatible and to allow applications on the same system to use different versions of the cache file. Please check my work and let me know what you think of the way I addressed allowing multiple versions of the cache file to coexist. https://github.com/go-text/typesetting/pull/87/commits/7be92018b00bd445b962106a17fe3255d3b96eca

benoitkugler commented 11 months ago

I've to admit I did not embrace the complexity of supporting several versions.. I think the version suffix is a nice solution, well done !

I've a question though: wouldn't it be acceptable to just discard the old version and build a new index from scratch (saving it with the proper suffix) when we don't find the most recent version. That would be slower then upgrading from older version, but since it only occurs once per app, it may still be alright.

whereswaldon commented 11 months ago

I've to admit I did not embrace the complexity of supporting several versions.. I think the version suffix is a nice solution, well done !

I've a question though: wouldn't it be acceptable to just discard the old version and build a new index from scratch (saving it with the proper suffix) when we don't find the most recent version. That would be slower then upgrading from older version, but since it only occurs once per app, it may still be alright.

Consider the case in which you have two applications installed that use different versions of go-text (different enough that the index file format doesn't match):

app A launches and creates index v1
app B launches and upgrades the index to v2
next time app A launches, it fails to read the index and overwrites it with a v1
next time app B launches, it upgrades the index to v2

In this scenario, at least one application is rebuilding the index on every launch from scratch.

What I implemented is the only strategy that I can see which allows multiple applications that understand different index file versions to coexist.

benoitkugler commented 11 months ago

next time app A launches, it fails to read the index and overwrites it with a v1

next time app B launches, it upgrades the index to v2

Hum.... since app B will write a "2" suffix, app A should not be impacted, should it ?

That is, my proposition is to use your suffix approach, but not bother writing code to upgrade v1 to v2 (and later v1 to v3, v2 to v3, etc...)

whereswaldon commented 11 months ago

next time app A launches, it fails to read the index and overwrites it with a v1

next time app B launches, it upgrades the index to v2

Hum.... since app B will write a "2" suffix, app A should not be impacted, should it ?

That is, my proposition is to use your suffix approach, but not bother writing code to upgrade v1 to v2 (and later v1 to v3, v2 to v3, etc...)

I think I may have misunderstood you:

wouldn't it be acceptable to just discard the old version

I thought you meant to delete the on-disk index, which would force the older app to keep rebuilding it. If you just mean to not bother loading older index versions and instead only load the current latest, that's certainly viable. It means more rebuilds total, but hopefully changes to this file format are so rare that we're talking about the difference between 0 and 5 times in the course of years. That's like a second of CPU time, so maybe not worth worrying about.

benoitkugler commented 11 months ago

I thought you meant to delete the on-disk index, which would force the older app to keep rebuilding it. If you just mean to not bother loading older index versions and instead only load the current latest, that's certainly viable. It means more rebuilds total, but hopefully changes to this file format are so rare that we're talking about the difference between 0 and 5 times in the course of years. That's like a second of CPU time, so maybe not worth worrying about.

Yes, that was point indeed ! Sorry for not being precise enough :)

In this case, I would rather choose the simpler code. We can always add the more refined approach you have written later if needed

benoitkugler commented 11 months ago

By the way, I think there is one change we may safely forecast : adding the set of languages supported by a font. It may be deduced from the runeSet, but, similarly to the scriptSet, is quite expensive to compute, so I guess we would rather store it on the footprint.

I should be able to post a PR soon if you think we should add it before the index format is widely used.

whereswaldon commented 11 months ago

Okay, in that case we don't actually need to maintain a backwards compatible deserializer at all. A lot of the changes I just made aren't actually needed. :laughing: We can simply check whether the latest index version exists and recreate it if not. No need to worry about the version number during deserialization.

whereswaldon commented 11 months ago

By the way, I think there is one change we may safely forecast : adding the set of languages supported by a font. It may be deduced from the runeSet, but, similarly to the scriptSet, is quite expensive to compute, so I guess we would rather store it on the footprint.

I should be able to post a PR soon if you think we should add it before the index format is widely used.

I don't think it's critical to get languages in there immediately, though it would definitely be good to have that.

whereswaldon commented 11 months ago

I will force-push to drop my unneeded changes and implement loading/creating only the current index file version.

benoitkugler commented 11 months ago

A lot of the changes I just made aren't actually needed. 😆

Huh.. sorry about that, not a nice feeling at all :(

whereswaldon commented 11 months ago

A lot of the changes I just made aren't actually needed. laughing

Huh.. sorry about that, not a nice feeling at all :(

No worries. We're able to shed a lot of implementation complexity this way.

whereswaldon commented 11 months ago

@benoitkugler We could just consolidate this and #85 ? Or do you see value in keeping them separate?

benoitkugler commented 11 months ago

@benoitkugler We could just consolidate this and #85 ? Or do you see value in keeping them separate?

No need to keep them separate indeed !

andydotxyz commented 11 months ago

Sorry I missed the discussion, great conclusion. Multiple indexes avoiding the complexity of multiple version parsing works nicely :)

go-text / typesetting

Script search #87