misskey-dev / mfm.js

An MFM parser implementation with TypeScript.
MIT License
108 stars 19 forks source link

reconsider which code points can be in hashtags #106

Open Johann150 opened 2 years ago

Johann150 commented 2 years ago

I have often seen hashtags being recognized in something that was definitely not intended to be a hashtag.

for example:

T=EÔ"º8øzÄ?k=ÿ#Ô¥ÄÀ¬"Üpz_µkýQ<)ÝÑIµë|`®ÿfeäÁ©¶Æ×çcDØ6=²Áå7À¾l|<à¾,3;«V(Cµ#ÒlP;Â0·¶R»ÛW篻ø6®9ÊëDa+¼ôà¬WG´w¾½Èírs¡Ò+p\z¿L9ÊGÞ7îR image

Also apparently some spacing characters may be part of hashtags which is definitely incorrect. For example a nonbreaking space (U+00A0) is recognized as part of a hashtag. https://genau.qwertqwefsday.eu/notes/901diers1g

marihachi commented 2 years ago

例としてあげているのは作ったものですか?実際にどれくらい発生するのかが重要です。

Johann150 commented 2 years ago

Both examples are not by me. The first one is probably less common case, it was taken from https://genau.qwertqwefsday.eu/notes/8zvhj6kdbj

syuilo commented 2 years ago

#の前に空白があるか、行の先頭に無い限りハッシュタグと見なさないようにしても良いかも

syuilo commented 2 years ago

ただ主に日本語などの分かち書きではない言語で不便になるケースもあるかも 以下のいずれもハッシュタグと認識されなくなる

Johann150 commented 2 years ago

For another example I also often see people from other Fediverse software trying to separate a hashtag from the rest of a word if they only want a part of the word to be the hashtag, e.g. #hash|tag. See for example https://genau.qwertqwefsday.eu/notes/8zwzta88ki

marihachi commented 2 years ago

そもそもハッシュタグを誤認識するパターンが稀なので、重要度は高くなさそう。

の前に空白があるか、行の先頭に無い限りハッシュタグと見なさないようにしても良いかも

この案で対応するとしても、デメリットが大きい。

Also apparently some spacing characters may be part of hashtags which is definitely incorrect. For example a nonbreaking space (U+00A0) is recognized as part of a hashtag. https://genau.qwertqwefsday.eu/notes/901diers1g

これについては修正したほうが良さそう。