wakori / googlesitemapgenerator

Automatically exported from code.google.com/p/googlesitemapgenerator
0 stars 0 forks source link

Errornous handling of response codes #13

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Enable Webserver filter
2. Access a file with your browser that does not exist
3. Generate sitemap

What is the expected output? What do you see instead?
Sitemap SHOULD be unchanged. The requested file which could not be found
MUST NOT show up in the sitemap.

What version of the product are you using? On what operating system?
sitemap_linux-x86_64-beta1.tar.gz
Debian etch with Apache 2.2

Please provide any additional information below.
This error also occures on rewrites (301/302). Both the old and the new
resource show up in the sitemap.

Original issue reported on code.google.com by laurens....@gmail.com on 4 Feb 2009 at 10:24

GoogleCodeExporter commented 8 years ago
What rewrite mechanism (or module) is used?

What do you mean by "the old" and "the new resources"?

Thanks!

Original comment by ma...@google.com on 16 Feb 2009 at 11:03

GoogleCodeExporter commented 8 years ago
I'm using mod_rewrite on Apache/2.2.3 64-bit.

old = resource being requested
new = resource being redirected to (using 30x)

Example:
http://www.example.com/old.html is redirected to http://www.example.com/new.html
using a 301 HTTP response. Both resources would show up in the current version 
but
only the new URL really should to avoid unnecessary redirects using crawlers.

Thanks in advance!

Original comment by laurens....@gmail.com on 16 Feb 2009 at 4:00

GoogleCodeExporter commented 8 years ago
I see. Thanks.
This is an expected behavior.
There may be some bug in code.
We'll try to repeat it.

Original comment by opensour...@gtempaccount.com on 17 Feb 2009 at 2:38

GoogleCodeExporter commented 8 years ago
In our tests, 
1) un-exist URL with 404 status will not appear in sitemap.

2) For redirect with 301/302, old URL is excluded and new URL is encluded.

Note, webserver filter can only get url information when there is a request to 
your
http server. Simply removing web pages from your disk doesn't work.

Original comment by ma...@google.com on 25 Feb 2009 at 5:33

GoogleCodeExporter commented 8 years ago
We built a CMS in PH where our friendly links (SLUGS?) are actual paths that do 
not
exist on the server and cannot be found by Apache. Our CMS's custom error 
handler
catches the 404 error and determines if the friendly link actually exists in 
the CMS.
If it does, it returns a HTTP 200 OK and loads the page's content without doing 
a
redirect. Unfortunately, Google Site Map generator seems to be seeing the 404 
error
before the CMS can correct it. Basically, I would need our CMS's error handler 
to
process the 404 before GSMG processes it. 

Example of a friendly link:

http://www.newfangled.com/website_development_experience_since_1995

Please let me know if you need any further information,

thanks!

Original comment by mike.p.b...@gmail.com on 6 Apr 2009 at 6:33

GoogleCodeExporter commented 8 years ago
404 status urls are included in the sitemap. This really needs to be fixed or 
the sitemap fills up with old pages. 

Is this mod being maintained by anyone in google anymore?

Original comment by pastordanwalker@gmail.com on 20 Nov 2010 at 4:35