Closed xjaphx closed 11 years ago
Could you please create a gist with the input.html
file you're using?
Here you are: https://gist.github.com/4630682 Oh one thing I forgot to mention, that I'm testing on Android to parse this data.
I've tried reproducing this, with no issues. h3 > a
returns 89 hits, div.nav li a
gives 5. The parse tree looks fine.
Can you show us doc.html() and compare that to the html() produced by fetching from a Jsoup.connect(url).get() ? I'm trying to ID if the issue is with how you've fetched the content, or saved it, or the file load.
Hi John, thanks for reply. In order to id the issue, I think it's better to start over where it starts first.
I create two project: one Java, the other Android The parsing functions are the same but the outputs are not.
I've create the sample repo, you might want to check it out: https://github.com/xjaphx/JSoupSample
OK. Thanks. The issue is that you are not specifying a user-agent when you fetch the URL, and so you are sending default user-agents from Java and from Android. And StackOverflow is sending you different HTML responses in return, one for desktops / crawlers (Java), and one for mobile agents (Android). The mobile version doesn't have anything that matches your selector.
I suggest using your browser's UA and setting it with http://jsoup.org/apidocs/org/jsoup/Connection.html#userAgent(java.lang.String)
Also you might like to use a debugging proxy like http://www.charlesproxy.com/ to watch HTTP traffic your apps are making.
Please give it a go and let me know your results.
oh wow, that's right. I've never thought of it. By updating the User-Agent to Jsoup before parsing, the responses are match. I've checked up on documentation but this is not mentioned anywhere. It's nice if you can add a note about this, under, Parser or Jsoup.connect().
Problem solved! Thanks John.
Cool -- glad we found it. Yep I'll mention in the .connect() method that it's a good idea to set the UA and a timeout. It might be a good idea to create a default UA based on a desktop browser.
i want to show text and url in listview parsed using jsoup plzzz help me i tried a lot but dont get succes yet.. here is the link of my code in stack over flow http://stackoverflow.com/questions/15307970/listview-of-jsoup-parsed-data-in-android
Hi.I am using Jsoup to parse url and I am using .connect() , timeout() and useragent() methods for parsing, but still I am not able to fetch entire page, some tags are missing.
try
Document = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();
http://jmchung.github.io/blog/2013/10/25/how-to-solve-jsoup-does-not-get-complete-html-document/
Hi. I have already implemented the suggestion from cobr123 but it did not work. But when I try to get the page on try.jsoup.org it can retrieve a complete html page. Do you have any suggestions?
inohtaf, can you post url which did not work?
Hi cobr123 for example http://kbbi.web.id/mempelajari, I have experienced on the other site too before. on that site I would like to retrieve : Element content = doc.select("div.content").first(); Element desc = content.select("div#desc").first(); Element descDetail = desc.select("div#d1").first();
Unfortunately the content of "div#d1" can not be found. But I am pretty sure it should be worked using Jsoup since it can be retrieved perfectly on try.jsoup.org. Hope for your suggestions :)
this works well
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.io.IOException;
public class Test {
public static void main(String[] args) throws IOException {
String url = "http://kbbi.web.id/mempelajari";
Document doc = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.get();
Element content = doc.select("div.content > div#desc").first();
Element desc = content.select("div#desc").first();
Element descDetail = desc.select("div#d1").first();
System.out.println(descDetail);
}
}
and print
<div id="d1">
<div id="info"></div>
<b>ajar</b>
..
~100 lines of text
..
<b>~ mikro</b> teknik pelatihan mengajar yang jumlah muridnya dibatasi, misalnya 5—10 orang;
<b>~ remedial</b> pengajaran yang diberikan khusus untuk memperbaiki kesulitan belajar yang dialami murid
</div>
Hi cobr123. Yes it works well. It seems I made a mistake on my code yesterday. Btw, Thank you very much for your help :)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.get();
Does not work, I have tested it with soundcloud and it returns null or the error..
Give an example of a selector. Soundcloud uses autoload page content when scrolling.
Well I got to working it but only after messing by converting the website to txt file as the output is absolutelly different from what you see in inspect element... The html which I received in txt file is now used to grab data from url directly, which partly works for me :)
Document document = Jsoup.connect("https://www.instagram.com/p/BRdRSJtgABK/")
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();
hi @cobr123 @jhy I used this configuration, but it still not works, I wanna select src attr by img tag , but the result only src from script tag give me the result. When I print the html by using .html() give me not whole of html. Please give me some advice how do I get the whole html script
.header gives cant resolve in android studio what to do?
I save stackoverflow.com into a file: input.html load it inside:
File input = new File("input.html"); Document doc = Jsoup.parse(input, "UTF-8"); // first query Elements resultsA = doc.select("h3 > a"); // second query Elements resultsB = doc.select("div.nav li a");
"resultsA" has no element while "resultsB" contains 6 found elements. Wondering by it, I extract the html content from "doc" variable; wow, it contains just part of HTML, where "resultsB" html content can be found, but not content for "resultsA".
I've tried to parse several URLs (even, google.com), it all results in the same way. "Jsoup.parse()" doesn't return whole HTML content.