jesbin / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Crawl Duplicate URLs #275

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Add seed list two URLs
   http://www.example.com/WW/Sample.html
   http://www.example.com/ww/sample.html
2. Run crawler
3. Review Log

What is the expected output?
It would crawl only one URL because both are the same
What do you see instead?
It is crawling both URLs

What version of the product are you using?
3.5

Please provide any additional information below.
Both URLs send to the same site (case insensitive)

Original issue reported on code.google.com by edgar.ri...@gmail.com on 13 Aug 2014 at 2:35

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 13 Aug 2014 at 6:14

GoogleCodeExporter commented 8 years ago
ok, I tried your scenario but the bug doesn't reproduce.

1. your example is bad as your URLs (in example.com) return 404...
2. I have used the following scenario to try and reproduce the bug:
controller.addSeed("http://www.w3.org/TR/WD-html40-970708/htmlweb.html");
controller.addSeed("http://www.w3.org/Tr/Wd-html40-970708/hTmlweb.html");

As you can see, the second URL has different case sensitivity in several 
letters.

The URL is being crawled only once, and is skipped (doesn't get to the "visit" 
method the second time.)

Please send me a specific scenario (with URLs which don't work) - I will run 
and test them as I tried and the bug doesn't reproduce.

As a side note - we should reconsider the crawler's functinoality, as the DNS 
name is case insensitive BUT the path should actually be case sensitive

Original comment by avrah...@gmail.com on 13 Aug 2014 at 7:45

GoogleCodeExporter commented 8 years ago
Hi,
In your example both URLs 
http://www.w3.org/TR/WD-html40-970708/htmlweb.html
http://www.w3.org/Tr/Wd-html40-970708/hTmlweb.html

Redirect to 
http://www.w3.org/TR/WD-html40-970708/htmlweb.html

But in my case, 
http://www.example.com/WW/Sample.html
http://www.example.com/ww/sample.html

don't redirect to any correct one :(, it stay in the same URL, the response URL 
is the same.

Original comment by edgar.ri...@gmail.com on 19 Aug 2014 at 5:22

GoogleCodeExporter commented 8 years ago
The URLs that Im working on, are located in a Sharepoint

Original comment by edgar.ri...@gmail.com on 19 Aug 2014 at 5:24

GoogleCodeExporter commented 8 years ago
After some investigations, seems that the problem is URL Rewrite functionality 
not available below IIS Servers, I don't know what would be the best way to fix 
this issue :(

Original comment by edgar.ri...@gmail.com on 19 Aug 2014 at 7:22

GoogleCodeExporter commented 8 years ago
Sorry mate, I don't think I can help you in this scenario...

Original comment by avrah...@gmail.com on 20 Aug 2014 at 8:31

GoogleCodeExporter commented 8 years ago
Thanks, sounds that is a expected behavior 

Original comment by edgar.ri...@gmail.com on 21 Aug 2014 at 3:31

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 21 Aug 2014 at 8:39