theCrag / website

theCrag.com: Add your voice and help guide the development of the world's largest collaborative rock climbing & bouldering platform
https://www.thecrag.com/
111 stars 8 forks source link

Hashtagging and Umlaut #2158

Open birgander2 opened 8 years ago

birgander2 commented 8 years ago

If a hashtag contains a non-us character, the tag gets broken at its position: Example:

Abseilöse -> [#Abseil]öse

Mdemaillard commented 4 years ago

seems unrelated to utf4mb4 encoding, still the case on beta server

scd commented 4 years ago

This is definitely not related to the four byte utf issue.

It will be a regex not defined properly somewhere. When I find it I may just expose it in this issue so that it is transparent and if it breaks somewhere else then it somebody else can provide me with an updated re-exp.

OK I have had a quick look and it looks like something is in Markdown.pm

$CIDS::Application::Markdown::StartBoundary = '(?:^|[\s\n\.,\(]|\<p\>|\<li\>|<span\>)';
$CIDS::Application::Markdown::EndHashLookaheadBoundary = '(?=[^-_0-9A-Za-z]|$)';

$text =~ s/($CIDS::Application::Markdown::StartBoundary)#([-_0-9]*[A-Za-z][-_0-9A-Za-z]*)$CIDS::Application::Markdown::EndHashLookaheadBoundary/$1<a class=\"tags\">#$2<\/a>/g;

But this also has to match what is parsed on the server when a text is saved with a hash tag. For my reference this is in API/Tools.pm

sub parseHashTags  {
  my ( $string ) = @_;
  return () unless defined $string;
  return ($string =~ m/(?:^|\s)#([-_0-9]*[A-Za-z][-_0-9A-Za-z]*)(?=[^-_0-9A-Za-z]|$)/g);
}

The start boundary for Markdown includes extra stuff because html may have replaced in the text.

Anybody who wants to take some dev cycles off me please feel free to expand this regex

lordyavin commented 4 years ago

@scd I digged a little bit into it. I fear there is no simple solution to fix this issue as it does not apply to German special letters only. As theCrag is an international site we need to support word (\w) characters of every language.

A quick fix would be (?:^|\s)#([\wöäü]+) but all other special word characters are still not matching.

Searching the web for a solution I learned that there Unicode issues with JavaScript and Perl and that people use nasty workarounds. I found a thread on stackoverflow that states that you could use perlunicode and then the \w should match all Unicode word characters as it does for ASCII. But you must convert your non-ASCII, non-UTF-8 Perl scripts to be UTF-8.

Of course we could limit hashtags to ASCII-Characters only but then we need to prompt the user if using Non-ASCII Characters.