thunderbird / thunderbird-android

K-9 Mail – Open Source Email App for Android
https://k9mail.app/
Apache License 2.0
9.97k stars 2.47k forks source link

Improve URL detection to support unencoded non-ASCII characters #4080

Open JsBergbau opened 5 years ago

JsBergbau commented 5 years ago

K9-Mail builds links incorrectly. Thats very annoying when there is a mail from url watch like "CHANGED: Google (https://www.google.de/maps)" You can click that link but URL won't be found because there is a closing bracket at the end of the URL and that can't work

Expected behavior

Link is correctly build

Actual behavior

A closing bracket is appended to the link

Steps to reproduce

  1. Send a text only mail like "See here (https://www.google.de/maps)"
  2. View that E-Mail in K9-Mail and click the link. You'll get an error not found because it tries to open "https://www.google.de/maps)"
  3. Interesstingly when sending "See here (https://www.google.de)" it works correctly, so as soon there is a slash / in the URL K9 doesn't build the URL correct any more. Thunderbird does it by the way. You can send that text only mail by echo "See here (https://www.google.de/maps)" | mail -s test yourmail@company.com

Environment

K-9 Mail version: 5.600 Android version: 8.0.0 Account type (IMAP, POP3, WebDAV/Exchange): IMAP Please take some time to retrieve logs and attach them here:

cketti commented 5 years ago

Parentheses are allowed characters in the path of an http URL. In emails URLs are typically enclosed in angle brackets for this reason, e.g. <https://domain.example/some/path>.

JsBergbau commented 4 years ago

Ok then the detection shoult work that way if URL begings with opening bracket than the closing bracket should not be considered as part of the URL.

Flexmaen commented 4 years ago

I did do some tests with different mail clients on tricky URLs and post my result here. K9 is the one with black background.

K9 always assums ) to be part of the URL. It even gets trickier if the URL contains brackets.

K9 also fails when the URL contains umlautes (e.g. "ü" in this case), although they can be part of an URL. So I'd rather consider this to be a bug than an enhancement.

url_test url_test_gmx url_test_k9

mikini commented 4 years ago

I just had a similar issue receiving a link from translatewiki.net containing square brackets which k-9 also didn't parse correctly (square bracket open ends link parsing).

Intended url: https://translatewiki.net/w/i.php?title=Translating_talk:OpenStreetMap&offset=20200416225206&lqt_mustshow=57558#About_[[Osm:Browse.in_changeset/de]]_57558

Parsed url: https://translatewiki.net/w/i.php?title=Translating_talk:OpenStreetMap&offset=20200416225206&lqt_mustshow=57558#About_

K-9 version: 5.708 (latest from F-Droid)

Screenshot: Screenshot_20200417-131740

cketti commented 3 years ago

URL detection in text is great fun. GitHub could be better, too :)

This is a boring URL: https://domain.example/path Text <https://domain.example/path> Text (https://domain.example/path) Text (https://domain.example/path).

This is a URL containing parentheses: https://domain.example/(path) Text <https://domain.example/(path)> Text (https://domain.example/(path)) Text (https://domain.example/(path)).

This is a URL containing unmatched parentheses: https://domain.example/(path)) Text <https://domain.example/(path))> Text (https://domain.example/(path))) Text (https://domain.example/(path))).

This is a URL ending in a dot: https://domain.example/path. Text <https://domain.example/path.> Text (https://domain.example/path.) Text (https://domain.example/path.).

This is a URL ending in a question mark: https://domain.example/path? Text <https://domain.example/path?> Text (https://domain.example/path?) Text (https://domain.example/path?).

Pull request #4996 will improve detection for URLs wrapped in parentheses and/or ending in punctuation that probably signifies the end of the sentence rather than being part of the URL. Especially when unmatched parentheses are part of the URL things get tricky and reasonable people can disagree on what should be done. I opted to include as much as possible and only remove one closing parenthesis if the URL is preceded by an opening parenthesis.


In the cases where K-9 Mail doesn't detect the whole URL it is technically right. Those characters are not allowed in unencoded form in URLs. However, copying such URLs to the address bar of a browser does the right thing. So we should probably extend the URL detection to also allow such "display URLs".

Flexmaen commented 3 years ago

Sure, URLs get tricky especially when ending with . or ). However, Umlauts (äöüß) can be part of an URL so no need to stop there, there is nothing to guess there.

sicherist commented 3 years ago

I can confirm that Umlauts break URL rendering.

paulchen commented 1 year ago

This issue not only affects German umlauts, it affects all languages that do not use the Latin alphabet.

I'm subscribed to the daily-image-l@lists.wikimedia.org mailing list. As the images which are chosen to be "Picture of the Day" on Wikimedia Commons are taken at locations and by people around the world, every other day I receive an email containing a link that I cannot click in K9. However, in Thunderbird all links work.

Unfortunately, the web archive of that mailing list doesn't handle encodings correctly. Therefore, example links cannot be taken from there.

Here are some examples of links from the recent weeks in different languages:

cketti commented 1 year ago

Special characters in URLs need to be encoded, otherwise it's not a valid URL. Browsers decode special characters when displaying the URL in the address bar. But when copying the URL to the clipboard, special characters are properly encoded. There's no reason why a "display URL" should end up in the plain text part of an email. If it does, that should be considered a mistake the sender should fix on their side.

Whether we'll add support for display URLs remains to be seen. But it will always be a way to support broken emails, not the right thing to do.

Ask the senders of such emails to fix their code so only properly encoded URLs are included in their emails.

Flexmaen commented 1 year ago

Special characters in URLs need to be encoded, otherwise it's not a valid URL.

Are you sure? I can remember the discussion on Chromium where they didn't want to encode the URLs since they said that they are valid anyway and wrong recognization es the problem of the other side... But I think now they do decode some things, however opinions where different on that and there are still too many URLs with umlauts etc.

JsBergbau commented 1 year ago

This problem also occurs when there is an only text E-Mail, so there is normally no special encoding for URLs. Nevertheless K9-Mail is so helpful to generate that a touchable link, so you don't have to copy the text to open it in your browser.

Just have a look is Github does it here, the Link with brackets in the first post is correctly built and K9-Mail should also use this behaviour.

yaomtc commented 7 months ago

What's wrong with the email address in this tweet? https://twitter.com/RideDDOT/status/1753109640299385062