tasfe / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 1 forks source link

crawl JSON content instead of HTML #216

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
Parser.java is using HtmlParser HtmlContentHandler (line 83)
to parse for outgoing links (href).

this is not applicable for RESTful API services by many sites (i.e, facebook 
graph, in which i'd like to crawl user friends)

What is the expected output? What do you see instead?
Parser should have setContentHandler

What version of the product are you using?
3.5

Please provide any additional information below.
feel free to reach out

Original issue reported on code.google.com by Kaminsky...@gmail.com on 27 Apr 2013 at 6:11

GoogleCodeExporter commented 9 years ago
forget to mark as 'Enhancement' rather than 'defect'

Original comment by Kaminsky...@gmail.com on 27 Apr 2013 at 6:12

GoogleCodeExporter commented 9 years ago
Please provide example URL so I can test it

Original comment by avrah...@gmail.com on 11 Aug 2014 at 2:40

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:38

GoogleCodeExporter commented 9 years ago
I think I just fixed this issue in the latest commit.

But if you can supply me with an example it will be best.

I am closing this issue for now and will reopen if the need will come

Original comment by avrah...@gmail.com on 23 Sep 2014 at 2:09

GoogleCodeExporter commented 9 years ago
this is more than a year ago my friend :) 
why not moving away to github?

Original comment by Kaminsky...@gmail.com on 23 Sep 2014 at 4:40

GoogleCodeExporter commented 9 years ago
Well, better later than never  :-)

We thought about GitHub, but ruled it out for now because of several minor 
technical issues - we might still move over eventually though.

Original comment by avrah...@gmail.com on 24 Sep 2014 at 8:48

GoogleCodeExporter commented 9 years ago
true that Avi, well i pretty much described everything in the ticket, 
so you can either test your solution and close the ticket or ask someone from 
the 
QA team to do that.

anyways i've dropped my project that user c4j so i can't do it for you.
see you around.

Original comment by Kaminsky...@gmail.com on 24 Sep 2014 at 9:09

GoogleCodeExporter commented 9 years ago
See you around.

And thank you.

Original comment by avrah...@gmail.com on 24 Sep 2014 at 9:28

GoogleCodeExporter commented 9 years ago
Here's a sample url for REST api: GET 
https://api.github.com/search/users?q=Java. Let me know if you need more 
information

Original comment by davidak...@gmail.com on 25 Sep 2014 at 3:40

GoogleCodeExporter commented 9 years ago

Thank you David.

Original comment by nil...@gmail.com on 27 Sep 2014 at 8:21

GoogleCodeExporter commented 9 years ago
Re-opened the issue

Original comment by avrah...@gmail.com on 29 Sep 2014 at 1:50

GoogleCodeExporter commented 9 years ago
Checked and it works now - my latest commits solve this one.

In order to crawl JSON just enable crawling of binary content though

Revi: 0a9abddd67db

Original comment by avrah...@gmail.com on 2 Oct 2014 at 12:53