Parsed emojis' lengths aren't calculated consistently

justinwilaby / sax-wasm

The first streamable, fixed memory XML, HTML, and JSX parser for WebAssembly.

MIT License

168 stars 8 forks source link

Parsed emojis' lengths aren't calculated consistently #57

Closed andrico1234 closed 2 years ago

andrico1234 commented 2 years ago

Describe the bug After parsing an html page that uses unicode characters like emojjis, the length of the parsed unicode character is incorrect.

To Reproduce Steps to reproduce the behavior:

Go to this repro repo
Clone and install the dependencies
run node test.js
Observe error

The repro case takes the following html as input:

📚<div href="./123/123">hey there</div>

and aims to replace the value of href with another string: 234.

The expected output would be :

📚<div href="234">hey there</div>

instead, the output is:

📚<div href=2343">hey there</div>

I don't know specifically why this is the case, but my gut reaction is that the sax-parser doesn't recognise unicode characters like 📚 may have a length greater than 1. Because this particular emoji has a length of 2, it's causing the replaceBetween function to incorrectly calculate where to replace the string.

Additional context This is made more clear when

daKmoR commented 2 years ago

We had this issue in the past as well and we worked around it by manually encoding it using the utf8 package...

I am to this day not sure why it solved the issue... I consider it a workaround as it means another dependency and you can not use streaming

const utf8 = require('utf8');

const content = utf8.encode(_content);
parser.eventHandler = (ev, _data) => { /* ... */ };
parser.write(Buffer.from(content));
parser.end();

const newContent = adjustContent(content, locationDataFromSaxWasm); // e.g. startCharacter, endCharater, ...
return utf8.decode(newContent);

Maybe this helps pin the issue down?

Let us know if there is any other way we can help 🤗

justinwilaby commented 2 years ago

The parser should absolutely take all utf-8 graphemes into account when determining start and end positions.

I'll look into this shortly.

justinwilaby commented 2 years ago

@andrico1234 - please confirm that #58 resolves your issue. If so, I'll merge and publish a patch.

Thank you for the bug report!

daKmoR commented 2 years ago

@justinwilaby yes that fixed it 🎉

you are an amazing maintainer 🤗 friendly, correct, and fast 💪

is there a way we can buy you a coffee or so? ☕

andrico1234 commented 2 years ago

+1 to that, i appreciate the fast work too!

justinwilaby commented 2 years ago

Thank you and you're are welcome!

justinwilaby commented 2 years ago

v2.1.3 has been published with this fix.

@daKmoR - Anytime you're in the Seattle area we can have coffee at one of the millions of coffee shops here!

daKmoR commented 2 years ago

Seattle is almost the other side of the world for me 🙈

but if fate puts me ever in this spot of the world - coffee will be yours ☕