Closed andrico1234 closed 2 years ago
We had this issue in the past as well and we worked around it by manually encoding it using the utf8
package...
I am to this day not sure why it solved the issue... I consider it a workaround as it means another dependency and you can not use streaming
const utf8 = require('utf8');
const content = utf8.encode(_content);
parser.eventHandler = (ev, _data) => { /* ... */ };
parser.write(Buffer.from(content));
parser.end();
const newContent = adjustContent(content, locationDataFromSaxWasm); // e.g. startCharacter, endCharater, ...
return utf8.decode(newContent);
Maybe this helps pin the issue down?
Let us know if there is any other way we can help 🤗
The parser should absolutely take all utf-8 graphemes into account when determining start and end positions.
I'll look into this shortly.
@andrico1234 - please confirm that #58 resolves your issue. If so, I'll merge and publish a patch.
Thank you for the bug report!
@justinwilaby yes that fixed it 🎉
you are an amazing maintainer 🤗 friendly, correct, and fast 💪
is there a way we can buy you a coffee or so? ☕
+1 to that, i appreciate the fast work too!
Thank you and you're are welcome!
v2.1.3 has been published with this fix.
@daKmoR - Anytime you're in the Seattle area we can have coffee at one of the millions of coffee shops here!
Seattle is almost the other side of the world for me 🙈
but if fate puts me ever in this spot of the world - coffee will be yours ☕
Describe the bug After parsing an html page that uses unicode characters like emojjis, the length of the parsed unicode character is incorrect.
To Reproduce Steps to reproduce the behavior:
node test.js
The repro case takes the following html as input:
and aims to replace the value of
href
with another string:234
.The expected output would be :
instead, the output is:
I don't know specifically why this is the case, but my gut reaction is that the sax-parser doesn't recognise unicode characters like
📚
may have a length greater than 1. Because this particular emoji has a length of 2, it's causing thereplaceBetween
function to incorrectly calculate where to replace the string.Additional context This is made more clear when