languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.37k stars 1.39k forks source link

[global] URL not correct detected #6240

Open tiff opened 2 years ago

tiff commented 2 years ago

https://winrwth.qualtrics.com/jfe/form/SV_5zKV0ltPdncOQ6?Q_DL=SMu6dkwrzrDD6PT_5zKV0ltbPdncOQ6_MLRP_6gp9vW6&Q_CHL=email

This URL is not detected as one: Bildschirmfoto 2022-01-11 um 16 35 17

tiff commented 2 years ago

Another case

https://webtranslateit.com/en/projects/19484-Website-languagetool-org/locales/en..de/strings/21631111
danielnaber commented 2 years ago

The problem in the first case is that the sentence detection is already wrong, it adds a sentence boundary at ?. Might be an easy fix by extending the character set here in segment.srx (I don't have time to work on it now, though):

<rule break="no"><!-- URLs without "www."-->
<beforebreak>\b(https?|ftp|file|chrome|chromium|android|(chrome|moz)\-extension):///?[A-Za-z0-9\-]+\.</beforebreak>
<afterbreak>[A-Za-z0-9\-]+(\.|\b)</afterbreak>
</rule>
D0LLYNH0 commented 2 years ago

Taking advantage of this issue, here are a few more cases:

http://foo.com/blah_blah_(wikipedia)_(again)
http://✪df.ws/123
http://➡.ws/䨹
http://⌘.ws
http://⌘.ws/
http://foo.com/unicode_(✪)_in_parens
http://foo.com/(something)?after=parens
https://i❤.ws/emojidomain/emoj💥i

Snap 022 • 15 01 2022 01h 19m 00s •  chrome