michaelklishin / urly

A tiny Clojure library that parses and attempts to unify URIs, URLs and relative values found in real world HTML anchors
116 stars 10 forks source link

Parsing URLs with IDN. #9

Open residentsummer opened 10 years ago

residentsummer commented 10 years ago

I was trying to use urly to convert unicode domain names to punycode (using .mutateHost), and here is what I've found:

(require '[clojurewerkz.urly.core :as urly])
(import '[java.net URL URI IDN])

(let [my-idn-url "http://фитомаркет-онлайн.рф/test.html"
      url-like (urly/url-like my-idn-url)
      url (URL. my-idn-url)
      uri (URI. my-idn-url)
      all [url-like uri url]]
  (doall (map println all))
  ; #<UrlLike http:/test.html>
  ; #<URI http://фитомаркет-онлайн.рф/test.html>
  ; #<URL http://фитомаркет-онлайн.рф/test.html>
  (doall (map #(println (.getHost %)) all))
  ; nil
  ; nil
  ; фитомаркет-онлайн.рф
  (doall (map #(println (.getAuthority %)) all))
  ; фитомаркет-онлайн.рф
  ; фитомаркет-онлайн.рф
  ; фитомаркет-онлайн.рф
  (doall (map (comp println urly/url-like) [uri url]))
  ; #<UrlLike http:/test.html>
  ; #<UrlLike <malformed URI>>  <-- That's weird
  ;
  ; And here is the solution
  (let [correct-url-like (urly/url-like url)
        host (.getHost correct-url-like)]
    (println correct-url-like)
    ; #<UrlLike <malformed URI>>  <-- Double weird
    (println host)
    ; фитомаркет-онлайн.рф
    (->
      correct-url-like
      (.mutateHost (IDN/toASCII host))
      (println))
    ; #<UrlLike http://xn----7sbbsnkdkeodcfy0agz.xn--p1ai/test.html>
    ; Hooray!
    ))

I'm not sure if it's a bug in urly, more likely it's in java.net.URI, can you confirm?

Versions:

Mac OS X 10.8.5
===
java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)
===
REPL-y 0.2.0
Clojure 1.5.1
michaelklishin commented 10 years ago

java.net.URI almost certainly doesn't handle IDNs. To solve this and several other limitations, Urly needs to use its own forgiving URI parser.

residentsummer commented 10 years ago

In today's search for IDN-capable URL library I've found this - https://github.com/smola/galimatias It may be helpful if you consider moving away from built-in Java URI/URL parser.

michaelklishin commented 10 years ago

@residentsummer thanks, that's helpful!