Configuration to set what type of links to crawl - SCRIPT,LINK,IMG etc.,

mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j

0 stars 0 forks source link

Configuration to set what type of links to crawl - SCRIPT,LINK,IMG etc., #109

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.Configuration to set what type of links to crawl - SCRIPT,LINK,IMG etc.,
2.
3.

What is the expected output? What do you see instead?
A comma separated configuration in CrawlConfig to limit the crawler from 
crawling. This functionality is useful & available in crawlers like NUTCH. Easy 
to implement this btw. 

What version of the product are you using?

Please provide any additional information below.

Original issue reported on code.google.com by w3engine...@gmail.com on 19 Jan 2012 at 3:00

GoogleCodeExporter commented 9 years ago

This will be included in the next release.

-Yasser

Original comment by ganjisaffar@gmail.com on 19 Jan 2012 at 7:09

Changed state: Accepted
Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

by the version of 3.3: How to crawl the javascript file? 
thankful for any response.

Original comment by wanxiang.xing@gmail.com on 12 Mar 2012 at 1:25

GoogleCodeExporter commented 9 years ago

I encountered the same problem. I find that the html parser(apache tika) does 
not extract src attribute in script element

wonder if there is an easy way to figure this out?

best wishes~

Original comment by cf.wfwei@gmail.com on 25 Apr 2012 at 5:19