mustache / spec

The Mustache spec.
MIT License
361 stars 71 forks source link

Added UTF8 tests #146

Open Krinkle opened 1 year ago

Krinkle commented 1 year ago

Upstreaming additional test cases written by @jbboehr. These were helpful in developing https://github.com/jbboehr/libmustache.

Original location: https://github.com/jbboehr/mustache-spec/commits/b96be9fd4c6d6984828d93169fe7e86d8a8aec2f

agentgt commented 8 months ago

Is it even possible for my implementation to fail in that case? After all, UTF-8 is a superset of ASCII. I don't think the octet 0x7b can mean anything other than { in UTF-8. Likewise for } and all the sigils.

I don't think it needs to be in the spec or have tests but I believe this is possible in some languages. I'm not going to reference where I discuss this as I got rather embarrassingly unhinged but apparently if you write say a PHP lambda code in latin1 (what encoding the file is saved in is what the literals become) and then take a section that is UTF-8 because the template is and return a new template and there is a chance of corruption.

While the probability of that happening in 2023 is pretty darn low back in the 2000s when I started my career it happened often (albeit usually with XML...).

Also if an implementation naively assumes ASCII and parses as UTF-8 template you will get corruption but that case is pretty obvious (you can concat ASCII bytes to UTF-8 but you cannot split UTF-8 strings as though they were ASCII).

That is you cannot just naively parse byte by byte looking for the { (0x7b) as the second bytes in a multi-byte character very well could be 0x7b. (I think. I'm fairly sure that UTF-8 does not say multi-bytes after the first have to be 127 but maybe I'm mistaken).

I guess it is not possible. I thought there were fringe cases with BOM so yes I guess your safe parsing byte by byte.

However if someone sets a delimiter above 127 then there will be problems. Maybe that is something that could be specified. Delimiters must be below 127. But again probably not worth it.