guorouda / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Meta refresh does not work correctly ? #225

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Add website smeagol.nl (with redirect smeagol.nl/home.nl)
2. Start crawler.
3. The crawler does not see the visit all the pages

What is the expected output? What do you see instead?
Following the redirection as part of the metadata element and the all pages 
should be visited. 

What version of the product are you using?
3.5

Please provide any additional information below.

I am using tika version 1.3. 
It seems that when you replace the following line : 
     (HtmlContentHandler.java line 118)

     String equiv = attributes.getValue(""http-equiv");
     String content = attributes.getValue("content");
     if (equiv != null && content != null) {
    equiv = equiv.toLowerCase();
With :

     String equiv = attributes.getValue("name");
     String content = attributes.getValue("content");
     if (equiv != null && content != null) {
    equiv = equiv.toLowerCase();

The problem is resolved. 

Original issue reported on code.google.com by zwart.sj...@gmail.com on 21 Jun 2013 at 8:56

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:40

GoogleCodeExporter commented 9 years ago
Which pages aren't seen exactly

Can you please post one example of a page which should but isn't crawled?

Original comment by avrah...@gmail.com on 26 Aug 2014 at 12:10

GoogleCodeExporter commented 9 years ago
ohh, the main page is a good example

Original comment by avrah...@gmail.com on 26 Aug 2014 at 1:17

GoogleCodeExporter commented 9 years ago
For reference, seems that Mackd tries to solve the same problem:
https://code.google.com/r/mackd2-tweaks/source/detail?r=9f553c635c910df1d43f67c6
4700e7a4694524f7

Original comment by avrah...@gmail.com on 26 Aug 2014 at 1:22

GoogleCodeExporter commented 9 years ago
Fixed in revision: b9a8b5cadaed 

Original comment by avrah...@gmail.com on 1 Sep 2014 at 7:30