Closed GoogleCodeExporter closed 9 years ago
Can you provide a URL which contains those things you want to scrape ?
Original comment by avrah...@gmail.com
on 17 Dec 2014 at 10:26
Thanks for the reply avrah! Here is my aim I want to crawl this domain
"http://www.sakshi.com" and extract all the iframe codes,base 64 codes etc..
only if they are present I am quite sure that this domain contains iframes,but
i am not sure about the rest(base 64, embed codes).
Original comment by yenumula...@gmail.com
on 17 Dec 2014 at 12:06
The way you try to do it it seems that it will take the iFrame URLs and put
them into the list of the URLs of the page - it seems to be ok, but I am not
sure this is what you want.
I think the best way for you to do it (if I understand your requirement) is to
use the visit() method, where you can find the html code of every visited page,
extract the iframe code from the html string!
Does this help ?
Original comment by avrah...@gmail.com
on 17 Dec 2014 at 12:26
Exactly! extracting iframes from html string is what I have tried before
posting the issue and I have attached the code to extract iframes and save the
iframe code in to a text file.But the problem is that I know iframe starts with
<iframe tag and ends with </iframe> tag. But in case base 64 code,vb scripts,
embed codes I am not understanding how they start and end in a html.So that is
y I am trying to htmlcontenthandler class! can u please help on that!
Original comment by yenumula...@gmail.com
on 17 Dec 2014 at 12:44
To parse iFrame use these:
http://stackoverflow.com/questions/13646163/how-to-get-body-holding-the-content-
of-iframe-in-java
http://stackoverflow.com/questions/26515383/jsoup-not-parsing-iframe-out-of-html
In order to try to parse anything else I need a solid example - scenario, give
me a URL with that code and I will see how to parse it.
Without an example you can't even check if it works
Original comment by avrah...@gmail.com
on 17 Dec 2014 at 12:49
Invalid as discussion was stopeed and the need is probably gone
Original comment by avrah...@gmail.com
on 22 Jan 2015 at 11:42
Original issue reported on code.google.com by
yenumula...@gmail.com
on 17 Dec 2014 at 10:06