snarfed / granary

💬 The social web translator
https://granary.io
Creative Commons Zero v1.0 Universal
439 stars 57 forks source link

is AS1 tag.length characters/graphemes? or UTF-8 bytes? or...? #705

Open snarfed opened 5 months ago

snarfed commented 5 months ago

I think in practice we treat it as Unicode characters/graphemes, but I don't think we're consistent about it, eg I think right now we convert it to Bluesky byteStart/End directly, without encoding to bytes first.

Doesn't matter in practice much because most of our data sources - microformats2 HTML, fediverse AS2 activities, etc - don't have explicit tag indices. Still though, even internally, this will bite us eventually.

snarfed commented 4 months ago

We've been interpreting it as unicode chars ie graphemes. Let's standardize on that.