Closed cangyuyao closed 2 years ago
@cangyuyao Thanks for catching this, it was mighty foolish on my part 🤦 I've just published version 2.1.3 which fixes this bug.
@cangyuyao Thanks for catching this, it was mighty foolish on my part 🤦 I've just published version 2.1.3 which fixes this bug.
Somehow I didn't get npm package updated yet but I saw your commit. I'm afraid \b
will cause matching failure on all tags containing CJK characters.
Also I'm not sure if this works for all non-CJK tags since Twitter has a complex syntax defining the boundary before and after a valid hashtag, mention, etc.
https://github.com/twitter/twitter-text/blob/master/js/pkg/twitter-text-3.1.0.js I haven't worked with TypeScript before but I've used this script to parse twitter text in another tool. Perhaps this can also be implemented here.
@cangyuyao Thanks for following up with more information. Let me do some more testing!
@cangyuyao I've re-written the hashtag, mention, and cashtag replacement function to use indices instead of regex. It now successfully handles CJK hashtags without issue, and should handle all hashtags in any script. I've added in some tests for this, and I've just published version 2.1.4.
Thanks again!
Thanks for the update! It works perfectly most time but I just found the length of emojis are different in JavaScript. A random example here: https://twitter.com/FollowBackAddic/status/1565545829942845440
I modify the substring
function to this:
for (const entity of allEntities) {
let chars = [...text];
text =
chars.slice(0, entity.start).join("") +
entity.replacement +
chars.slice(entity.end).join("");
}
@cangyuyao I'm learning so much about the failings of JavaScript with non-Latin characters 😓 Thanks for the suggested fix! Version 2.2.1 is live now.
Found another bug: replaceEntities
needs to run before decode_1
since the indices are based on the raw text, such as &
... Not sure if decoding will affect urls though.
@cangyuyao Thank you again! I've added some new tests covering these cases. Please try out version 2.2.2!
If a tweet contains hashtags like this:
abc
ab
The parsing result will be:
Each tag link is added to the first matched phrase it came across, which may not be the correct one. I think this is a similar issue to the @mention bug #9.