jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.94k stars 2.19k forks source link

Can't scrape Tor onion sites on Android #1174

Closed NandanDesai closed 5 years ago

NandanDesai commented 5 years ago

I've setup a SOCKS proxy using Tor and trying to scrape a Tor onion site.

Proxy proxy=new Proxy(Proxy.Type.SOCKS, new InetSocketAddress("127.0.0.1", 7150));
Jsoup.connect("http://uj3wazyk5u4hnvtk.onion").proxy(proxy).get();

It works fine on Desktop (x64 Linux). But causes java.net.UnknownHostException: Unable to resolve host "uj3wazyk5u4hnvtk.onion": No address associated with hostname on Android.

After a little bit of research, I found a comment in this code which says,

// Perform explicit SOCKS4a connection request. SOCKS4a supports remote host name resolution
// (i.e., Tor resolves the hostname, which may be an onion address).
// The Android (Apache Harmony) Socket class appears to support only SOCKS4 and throws an
// exception on an address created using INetAddress.createUnresolved() -- so the typical
// technique for using Java SOCKS4a/5 doesn't appear to work on Android

Here is my stack trace:

W/System.err: java.net.UnknownHostException: Unable to resolve host "uj3wazyk5u4hnvtk.onion": No address associated with hostname
W/System.err:     at java.net.Inet6AddressImpl.lookupHostByName(Inet6AddressImpl.java:125)
W/System.err:     at java.net.Inet6AddressImpl.lookupAllHostAddr(Inet6AddressImpl.java:74)
                  at java.net.InetAddress.getAllByName(InetAddress.java:752)
                  at com.android.okhttp.internal.Network$1.resolveInetAddresses(Network.java:29)
                  at com.android.okhttp.internal.http.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:187)
                  at com.android.okhttp.internal.http.RouteSelector.nextProxy(RouteSelector.java:156)
                  at com.android.okhttp.internal.http.RouteSelector.next(RouteSelector.java:98)
                  at com.android.okhttp.internal.http.HttpEngine.createNextConnection(HttpEngine.java:346)
                  at com.android.okhttp.internal.http.HttpEngine.connect(HttpEngine.java:329)
                  at com.android.okhttp.internal.http.HttpEngine.sendRequest(HttpEngine.java:247)
                  at com.android.okhttp.internal.huc.HttpURLConnectionImpl.execute(HttpURLConnectionImpl.java:457)
                  at com.android.okhttp.internal.huc.HttpURLConnectionImpl.connect(HttpURLConnectionImpl.java:126)
                  at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:746)
                  at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:722)
                  at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:306)
                  at org.jsoup.helper.HttpConnection.get(HttpConnection.java:295)
                  at com.github.torrentfetcher.sources.ThePirateBay.parsePirateBay(ThePirateBay.java:117)
jhy commented 5 years ago

That seems like an issue for Android and out of the scope of Jsoup. We don't deal with name resolution. Looks like its not passing DNS off to the socks proxy.