kbravh / tweet-to-markdown

A command line tool to convert Tweets to Markdown.
MIT License
159 stars 9 forks source link

#hashtag parsing bug #11

Closed cangyuyao closed 2 years ago

cangyuyao commented 2 years ago

If a tweet contains hashtags like this:

abc

ab

The parsing result will be:

[[#ab](https://twitter.com/hashtag/ab) c](https://twitter.com/hashtag/abc)  
#ab

Each tag link is added to the first matched phrase it came across, which may not be the correct one. I think this is a similar issue to the @mention bug #9.

kbravh commented 2 years ago

@cangyuyao Thanks for catching this, it was mighty foolish on my part 🤦 I've just published version 2.1.3 which fixes this bug.

cangyuyao commented 2 years ago

@cangyuyao Thanks for catching this, it was mighty foolish on my part 🤦 I've just published version 2.1.3 which fixes this bug.

Somehow I didn't get npm package updated yet but I saw your commit. I'm afraid \b will cause matching failure on all tags containing CJK characters.

Also I'm not sure if this works for all non-CJK tags since Twitter has a complex syntax defining the boundary before and after a valid hashtag, mention, etc.

https://github.com/twitter/twitter-text/blob/master/js/pkg/twitter-text-3.1.0.js I haven't worked with TypeScript before but I've used this script to parse twitter text in another tool. Perhaps this can also be implemented here.

kbravh commented 2 years ago

@cangyuyao Thanks for following up with more information. Let me do some more testing!

kbravh commented 2 years ago

@cangyuyao I've re-written the hashtag, mention, and cashtag replacement function to use indices instead of regex. It now successfully handles CJK hashtags without issue, and should handle all hashtags in any script. I've added in some tests for this, and I've just published version 2.1.4.

Thanks again!

cangyuyao commented 2 years ago

Thanks for the update! It works perfectly most time but I just found the length of emojis are different in JavaScript. A random example here: https://twitter.com/FollowBackAddic/status/1565545829942845440

I modify the substring function to this:

for (const entity of allEntities) {
        let chars = [...text];
        text =
            chars.slice(0, entity.start).join("") +
            entity.replacement +
            chars.slice(entity.end).join("");
}
kbravh commented 2 years ago

@cangyuyao I'm learning so much about the failings of JavaScript with non-Latin characters 😓 Thanks for the suggested fix! Version 2.2.1 is live now.

cangyuyao commented 2 years ago

Found another bug: replaceEntities needs to run before decode_1 since the indices are based on the raw text, such as & ... Not sure if decoding will affect urls though.

kbravh commented 2 years ago

@cangyuyao Thank you again! I've added some new tests covering these cases. Please try out version 2.2.2!