Jsoup.parse() doesn't seem to load whole HTML content

xjaphx commented 11 years ago

I save stackoverflow.com into a file: input.html load it inside:

File input = new File("input.html"); Document doc = Jsoup.parse(input, "UTF-8"); // first query Elements resultsA = doc.select("h3 > a"); // second query Elements resultsB = doc.select("div.nav li a");

"resultsA" has no element while "resultsB" contains 6 found elements. Wondering by it, I extract the html content from "doc" variable; wow, it contains just part of HTML, where "resultsB" html content can be found, but not content for "resultsA".

I've tried to parse several URLs (even, google.com), it all results in the same way. "Jsoup.parse()" doesn't return whole HTML content.

amferraz commented 11 years ago

Could you please create a gist with the input.html file you're using?

xjaphx commented 11 years ago

Here you are: https://gist.github.com/4630682 Oh one thing I forgot to mention, that I'm testing on Android to parse this data.

jhy commented 11 years ago

I've tried reproducing this, with no issues. h3 > a returns 89 hits, div.nav li a gives 5. The parse tree looks fine.

Can you show us doc.html() and compare that to the html() produced by fetching from a Jsoup.connect(url).get() ? I'm trying to ID if the issue is with how you've fetched the content, or saved it, or the file load.

xjaphx commented 11 years ago

Hi John, thanks for reply. In order to id the issue, I think it's better to start over where it starts first.

I create two project: one Java, the other Android The parsing functions are the same but the outputs are not.

I've create the sample repo, you might want to check it out: https://github.com/xjaphx/JSoupSample

The selector-syntax: ".summary h3 a" +) Output
Java project: 15 results found.
Android project: empty result.

jhy commented 11 years ago

OK. Thanks. The issue is that you are not specifying a user-agent when you fetch the URL, and so you are sending default user-agents from Java and from Android. And StackOverflow is sending you different HTML responses in return, one for desktops / crawlers (Java), and one for mobile agents (Android). The mobile version doesn't have anything that matches your selector.

I suggest using your browser's UA and setting it with http://jsoup.org/apidocs/org/jsoup/Connection.html#userAgent(java.lang.String)

Also you might like to use a debugging proxy like http://www.charlesproxy.com/ to watch HTTP traffic your apps are making.

Please give it a go and let me know your results.

xjaphx commented 11 years ago

oh wow, that's right. I've never thought of it. By updating the User-Agent to Jsoup before parsing, the responses are match. I've checked up on documentation but this is not mentioned anywhere. It's nice if you can add a note about this, under, Parser or Jsoup.connect().

Problem solved! Thanks John.

jhy commented 11 years ago

Cool -- glad we found it. Yep I'll mention in the .connect() method that it's a good idea to set the UA and a timeout. It might be a good idea to create a default UA based on a desktop browser.

RajatT commented 11 years ago

i want to show text and url in listview parsed using jsoup plzzz help me i tried a lot but dont get succes yet.. here is the link of my code in stack over flow http://stackoverflow.com/questions/15307970/listview-of-jsoup-parsed-data-in-android

nikhilekbote commented 8 years ago

Hi.I am using Jsoup to parse url and I am using .connect() , timeout() and useragent() methods for parsing, but still I am not able to fetch entire page, some tags are missing.

cobr123 commented 8 years ago

try

Document = Jsoup.connect(url)
    .header("Accept-Encoding", "gzip, deflate")
    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
    .maxBodySize(0)
    .timeout(600000)
    .get();

http://jmchung.github.io/blog/2013/10/25/how-to-solve-jsoup-does-not-get-complete-html-document/

inohtaf commented 8 years ago

Hi. I have already implemented the suggestion from cobr123 but it did not work. But when I try to get the page on try.jsoup.org it can retrieve a complete html page. Do you have any suggestions?

cobr123 commented 8 years ago

inohtaf, can you post url which did not work?

inohtaf commented 8 years ago

Hi cobr123 for example http://kbbi.web.id/mempelajari, I have experienced on the other site too before. on that site I would like to retrieve : Element content = doc.select("div.content").first(); Element desc = content.select("div#desc").first(); Element descDetail = desc.select("div#d1").first();

Unfortunately the content of "div#d1" can not be found. But I am pretty sure it should be worked using Jsoup since it can be retrieved perfectly on try.jsoup.org. Hope for your suggestions :)

cobr123 commented 8 years ago

this works well

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import java.io.IOException;

public class Test {
    public static void main(String[] args) throws IOException {
        String url = "http://kbbi.web.id/mempelajari";
        Document doc = Jsoup.connect(url)
                .header("Accept-Encoding", "gzip, deflate")
                .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
                .maxBodySize(0)
                .get();
        Element content = doc.select("div.content > div#desc").first();
        Element desc = content.select("div#desc").first();
        Element descDetail = desc.select("div#d1").first();
        System.out.println(descDetail);
    }
}

and print

<div id="d1"> 
 <div id="info"></div>
 <b>ajar</b> 
..
~100 lines of text
..
 <b>~ mikro</b> teknik pelatihan mengajar yang jumlah muridnya dibatasi, misalnya 5—10 orang; 
 <b>~ remedial</b> pengajaran yang diberikan khusus untuk memperbaiki kesulitan belajar yang dialami murid 
</div>

inohtaf commented 8 years ago

Hi cobr123. Yes it works well. It seems I made a mistake on my code yesterday. Btw, Thank you very much for your help :)

SourceCipher commented 7 years ago

 .header("Accept-Encoding", "gzip, deflate")
                .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
                .maxBodySize(0)
                .get();

Does not work, I have tested it with soundcloud and it returns null or the error..

cobr123 commented 7 years ago

Give an example of a selector. Soundcloud uses autoload page content when scrolling.

SourceCipher commented 7 years ago

Well I got to working it but only after messing by converting the website to txt file as the output is absolutelly different from what you see in inspect element... The html which I received in txt file is now used to grab data from url directly, which partly works for me :)

4sskick commented 7 years ago

Document document = Jsoup.connect("https://www.instagram.com/p/BRdRSJtgABK/")
                    .header("Accept-Encoding", "gzip, deflate")
                    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
                    .maxBodySize(0)
                    .timeout(600000)
                    .get();

hi @cobr123 @jhy I used this configuration, but it still not works, I wanna select src attr by img tag , but the result only src from script tag give me the result. When I print the html by using .html() give me not whole of html. Please give me some advice how do I get the whole html script

k2sss44 commented 7 years ago

.header gives cant resolve in android studio what to do?

jhy / jsoup

Jsoup.parse() doesn't seem to load whole HTML content #287