lgraubner / sitemap-generator

Easily create XML sitemaps for your website.
MIT License
405 stars 129 forks source link

robots.txt not parsed on https redirect #2

Closed jhs-s closed 7 years ago

jhs-s commented 8 years ago

The robots.txt parser doesn't work in these cases:

It works if you provide https://example.com and Port 443. I think this should be added to the docs or fixed otherwise.

Thanks!

ayenzky commented 8 years ago

how did you fix this kind of issue? I also experiencing like this when my url is https.

lgraubner commented 8 years ago

Will look into it the next days. The port can be set to 443 by default if https is part of the provided URL. The first case should work. If a redirect to https is set up it should redirect and proceed crawling. But will look into this as well.

jhs-s commented 8 years ago

@ayenzky This works:

var generator = new SitemapGenerator("https://example.com", {port: 443});

I had a quick look at the code and it seems that the robots parser is not provided with port 443 in case the protocol is https by default but I didn't have a deeper look as providing the port worked.

lgraubner commented 8 years ago

Fixed in 97a4622. Port will be set accordingly if https is provided. Keeps respecting user provided port.

Please check if this is solving your problem.

ayenzky commented 8 years ago

@lgraubner, @jhs-s

Thanks for the immediate response.

I tested the new update and this is the output I get

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://lanarkshirechamber.co.uk:443/</loc>
  </url>
</urlset>
lgraubner commented 8 years ago

You are right. As I can see the crawler adds any port but port 80 to the URL. As this is a problem with the robots parser I changed this only for this part. I published version 4.1.1, please test again. Thanks!

ayenzky commented 8 years ago

@lgraubner

It still the same but sometimes It return "null"

lgraubner commented 8 years ago

Are you sure you have the latest version? sitemap-generator -V should return 4.1.1. The port problem should be fixed. On some sites with https it returns null as the crawler can't find the site. See #4. I'm not quite sure whats the problem as on some sites it works fine. Maybe the underlying package simple-crawler has problems with some certificates.

ayenzky commented 8 years ago

@lgraubner

Yes, I have the latest version..Ok so maybe the problem is on the simplecrawler package. Anyway thanks for the help..:)

shaicoleman commented 8 years ago

This doesn't work (adds :80 to the request host) var generator = new SitemapGenerator("https://forum.colemak.com"); This works: var generator = new SitemapGenerator("https://forum.colemak.com", {port: 443});

ayenzky commented 8 years ago

I guess on my case it doesn't work since I'm using a force ssl, http redirected to https.

lgraubner commented 7 years ago

There were some problems with specifying correct ports. Version 5 fixed this. Make sure to specify port directly in the URL and not as option anymore.