Open Krinkle opened 1 year ago
Is it even possible for my implementation to fail in that case? After all, UTF-8 is a superset of ASCII. I don't think the octet 0x7b can mean anything other than { in UTF-8. Likewise for } and all the sigils.
I don't think it needs to be in the spec or have tests but I believe this is possible in some languages. I'm not going to reference where I discuss this as I got rather embarrassingly unhinged but apparently if you write say a PHP lambda code in latin1 (what encoding the file is saved in is what the literals become) and then take a section that is UTF-8 because the template is and return a new template and there is a chance of corruption.
While the probability of that happening in 2023 is pretty darn low back in the 2000s when I started my career it happened often (albeit usually with XML...).
Also if an implementation naively assumes ASCII and parses as UTF-8 template you will get corruption but that case is pretty obvious (you can concat ASCII bytes to UTF-8 but you cannot split UTF-8 strings as though they were ASCII).
That is you cannot just naively parse byte by byte looking for the {
(0x7b
) as the second bytes in a multi-byte character very well could be 0x7b
. (I think. I'm fairly sure that UTF-8 does not say multi-bytes after the first have to be 127 but maybe I'm mistaken).
I guess it is not possible. I thought there were fringe cases with BOM so yes I guess your safe parsing byte by byte.
However if someone sets a delimiter above 127 then there will be problems. Maybe that is something that could be specified. Delimiters must be below 127. But again probably not worth it.
Upstreaming additional test cases written by @jbboehr. These were helpful in developing https://github.com/jbboehr/libmustache.
Original location: https://github.com/jbboehr/mustache-spec/commits/b96be9fd4c6d6984828d93169fe7e86d8a8aec2f