I think in practice we treat it as Unicode characters/graphemes, but I don't think we're consistent about it, eg I think right now we convert it to Bluesky byteStart/End directly, without encoding to bytes first.
Doesn't matter in practice much because most of our data sources - microformats2 HTML, fediverse AS2 activities, etc - don't have explicit tag indices. Still though, even internally, this will bite us eventually.
I think in practice we treat it as Unicode characters/graphemes, but I don't think we're consistent about it, eg I think right now we convert it to Bluesky byteStart/End directly, without encoding to bytes first.
Doesn't matter in practice much because most of our data sources - microformats2 HTML, fediverse AS2 activities, etc - don't have explicit tag indices. Still though, even internally, this will bite us eventually.