momzi / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Upgrade the Pattern constant on the crawler examples #301

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
in the crawler examples the "shouldvisit" method uses a Pattern in order to not 
visit urls with certain extensions.

This is a great way to eliminate the crawler from visiting non relevant pages.

This pattern can be further upgraded with the following:
1. More extensions
2. All of the existing extensions with one (or no) question mark right after 
them and then one or more characters.

This way also the following URLs will be filtered:
http://example.com/avi.mp3?key=value....

Original issue reported on code.google.com by avrah...@gmail.com on 2 Sep 2014 at 11:57

GoogleCodeExporter commented 9 years ago
add ico

Original comment by avrah...@gmail.com on 2 Sep 2014 at 12:04

GoogleCodeExporter commented 9 years ago
Rename FILTERS to BINARY_FILES

Original comment by avrah...@gmail.com on 23 Sep 2014 at 11:17

GoogleCodeExporter commented 9 years ago
Fixed in revision: 028dd360a054 

Original comment by avrah...@gmail.com on 5 Dec 2014 at 10:20