Crawling over disallowed paths from robots.txt

What steps will reproduce the problem?
1. Only crawl pages with prefix http://fano.ics.uci.edu/
2. have robotstxtConfig enabled
3. crawl from seed http://fano.ics.uci.edu/

What is the expected output? What do you see instead?
# fano.ics.uci.edu

User-Agent: *
Disallow: /ca/rules/

Should not be crawling /ca/rules/

These are being crawled
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g1.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g2.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g3.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g4.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g5.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g6.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g7.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g8.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g9.html

What version of the product are you using?

3.5

Please provide any additional information below.

Original issue reported on code.google.com by Dave.Hir...@gmail.com on 21 Jan 2015 at 9:06

pgyami / crawler4j

Crawling over disallowed paths from robots.txt #334