google / physical-web

The Physical Web: walk up and use anything
http://physical-web.org
Apache License 2.0
5.99k stars 665 forks source link

Android - non-ASCII URl issue #517

Open edent opened 9 years ago

edent commented 9 years ago

Added the following URl https://莎士比亚.org/

screenshot_2015-09-25-15-53-30

When displayed in the app, it is corrupt.

screenshot_2015-09-25-15-53-54

Clicking on the link goes to to the wrong location.

screenshot_2015-09-25-15-54-08

If I use the punycode representation xn--jlQ54W7yPemW.org everything works correctly.

screenshot_2015-09-25-16-15-43

Although the Title and Favicon don't load.

g-ortuno commented 9 years ago

Well per the Eddystone Spec: URLs are written only with the graphic printable characters of the US-ASCII coded character set. The octets 00-20 and 7F-FF hexadecimal are not used. See “Excluded US-ASCII Characters” in RFC 2936.

So I believe this is Working as intended. If you want to use non ascii characters you need to open an issue on the Eddystone repo.

cco3 commented 9 years ago

Or, you can use a URL shortener.

Maybe what we should be doing here is automatically detecting these cases and sending it off to a shortener.

dinhvh commented 9 years ago

I think punnycode / IDN encoding (https://en.wikipedia.org/wiki/Internationalized_domain_name) is the correct way to have a internationalized URL to be encoded into US-ASCII characters. I think what the app fails to do here is: 1/ to encode the internationalized URL to US-ASCII when programming the beacon. 2/ to decode the encoded URL / domain name to a more readable string 3/ I'm not sure why the punnycode encoded URL won't load the metadata.

1/ and 2/ are features that needs to be implemented. 3/ probably needs more investigation.

dinhvh commented 9 years ago

The URL seems to be resolved properly by the PWS:

$ curl -d '{"objects":[{"url":"https://xn--jlq54w7ypemw.org/"}]}' https://url-caster.appspot.com/resolve-scan
{
    "metadata": [
        {
            "description": "Terence Eden Terence Eden \u6cf0\u7279\u65af\u00b7\u5b89\u5fb7\u6d1b\u5c3c\u514b\u65af Titus Andronicus \u79d1\u5229\u5965\u5170\u7eb3\u65af Coriolanus \u4ea8\u5229\u56db\u4e16\u4e0a\u7bc7 Henry IV - Part I \u88d8\u529b\u65af\u00b7\u51ef\u6492 Julius Caesar \u7686\u5927\u6b22\u559c As You Like It \u7ec8\u6210\u7737\u5c5e All's Well That Ends Well \u5965\u745f\u7f57 Othello \u7ea6\u7ff0\u738b King John \u96c5\u5178\u7684\u6cf0\u95e8 Timon of Athens \u4ef2\u590f\u591c\u4e4b\u68a6 A Midsummer Night's Dream \u4ea8\u5229\u516d\u4e16\u4e2d\u7bc7 Henry VI - Novella \u8f9b\u767d\u6797 Cymbeline \u7f57\u5bc6\u6b27\u4e0e\u6731\u4e3d\u53f6 Romeo and Juliet \u4ea8\u5229\u516d\u4e16\u4e0b\u7bc7 Henry VI - Part II \u51ac\u5929\u7684\u6545\u4e8b A Winter's Tale \u4ea8\u5229\u4e94\u4e16 Henry V \u9ea6\u514b\u767d Macbeth \u7406\u67e5\u4e09\u4e16 Richard III \u7406\u67e5\u4e8c\u4e16 Richard II \u4ea8\u5229\u56db\u4e16\u4e0b\u7bc7 Henry IV - Part II \u66b4\u98ce\u96e8 The Tempest \u4ea8\u5229\u516b\u4e16 Henry VIII \u6e29\u838e\u7684\u98ce\u6d41\u5a18\u513f\u4eec The Merry Wiv",
            "displayUrl": "https://xn--jlq54w7ypemw.org/",
            "icon": "https://url-caster.appspot.com/favicon?url=https%3A%2F%2Fxn--jlq54w7ypemw.org%2Fapple-touch-icon-iphone.png",
            "id": "https://xn--jlq54w7ypemw.org/",
            "rank": 1000,
            "title": "\u838e\u58eb\u6bd4\u4e9a.org / Readable Shakespeare Plays In Chinese",
            "url": "https://xn--jlq54w7ypemw.org/"
        }
    ]
}
edent commented 9 years ago

@dinhviethoa yes, I'd agree with that. The app should be normalising & decoding.

@cco3 automatic shortening would also be helpful - it currently just throws an error message.

nondebug commented 9 years ago

IMO there are three changes to make.

First, we should detect non-ASCII URLs and automatically send those to the shortener as cco3 suggests. We could try to puny-encode them ourselves but chances are the URL will be too long anyway.

Second, PWS should detect punycode URLs and send the punycode version as siteUrl (PWS calls this field "url") and the Unicode version as displayUrl. Currently, the punycode version is sent for both fields.

Third, the error message when the auto-shortener chokes on a unicode URL needs to be updated. It says "URL is too long. Please use a shortener" which is confusing if the URL is clearly short enough. Fortunately we should never see this since goo.gl is excellent with international URLs.

mmocny commented 9 years ago

I like the plan, but I'm thinking that PWS should not return Unicode is displayUrl.

Choosing to display the unicode version sounds like a client decision, and I'm not sure the PWS should be making client UX choices.

It's also not always going to be clear to PWS when a URL is punycode (or to the client for that matter), so perhaps we can just leave the URL in punycode?

edent commented 9 years ago

I'd like to politely challenge that notion.

No one is going to want to see the punycode version. It's impossible for anyone to decipher what it is supposed to represent. I would suggest that displaying the Unicode URl makes the most sense - it will display in the correct character set for the user.

A URl is punycode if it starts with xn-- http://www.atm.tut.fi/list-archive/ietf-announce/msg13572.html

The standard is relatively simple http://tools.ietf.org/html/rfc3492

Overall, adding this will aid adoption for people in non-ASCII territories.

dinhvh commented 9 years ago

I tried the unicodes URL both in Chrome and Safari on the desktop.

I'd say the experience was more enjoyable to me in Safari.

mmocny commented 9 years ago

I don't propose we enforce showing punycode, I'm just suggesting that the client can detect and convert if it so chooses, without the PWS making the choice for it. By returning unicode in displayUrl, we would be assuming all clients are able to display/use it.

It think the intention of displayUrl was originally meant to represent "the url which will open in browser", as opposed to "the url which we actually fire intent on". It was used to hide our use of a redirector (which we have long since dropped) -- so displayUrl is currently "unused".

So displayUrl was never meant to actually make UX level decisions about what clients should display, and I wonder if we shouldn't remove the value altogether.

Also, I think clients will want to actually process displayUrl in all sorts of ways. E.g. if displayUrl is http://www.foobarbaz.com/path/page.html#somevariable clients may choose to just show www.foobarbaz.com or just http://www.foobarbaz.com/path/page.html#... etc.

So I guess I agree our clients shouldn't show punycode, but I think we should not update PWS displayUrl as was previously suggested.

Thanks for pointing out that punycode is unambiguous, I only read the encoding format (suffix), and not the spec for prefix.