kjam / wswp

Code for the second edition Web Scraping with Python book by Packt Publications
129 stars 98 forks source link

Chapter 1 robots.txt not available, or sitemap.xml #6

Open jadegrave opened 6 years ago

jadegrave commented 6 years ago

Hi. I'm a beginner, with rusty Python skills, but I think their may be an issue here: These links don't display the text indicated in the book. They just display the home page:

Apologies if this is on the errata page. Haven't checked it yet. ...Just checked ...couldn't find the book to post it there.

nile649 commented 6 years ago

Sitemap is something that is provided by the user which stores all links in one file. People may or may not provide the file, and the section uses specific examples to display that sometimes websites are simple enough to keep their data organized in the order of ID. You should use the mentioned links only which are provided in the example of the book.

jadegrave commented 6 years ago

Wow! Thank you for the quick response.

For clarification, those are the links provided in the book…I’m using the version that is available through Safari Online Bookshelf

Thanks,

Jodi

From: Nilesh Pandey [mailto:notifications@github.com] Sent: Thursday, December 21, 2017 4:04 PM To: kjam/wswp Cc: jadegrave; Author Subject: Re: [kjam/wswp] Chapter 1 robots.txt not available, or sitemap.xml (#6)

Sitemap is something that is provided by the user which stores all links in one file. People may or may not provide the file, and the section uses specific examples to display that sometimes websites are simple enough to keep their data organized in the order of ID. You should use the mentioned links only which are provided in the example of the book.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kjam/wswp/issues/6#issuecomment-353469154 , or mute the thread https://github.com/notifications/unsubscribe-auth/AMjQdli5A9PuhbppT8m1YOMGAznefmyWks5tCtXtgaJpZM4RKXRf . https://github.com/notifications/beacon/AMjQdntpxgV-BnbyduJDLRtqFpjV5nEWks5tCtXtgaJpZM4RKXRf.gif


This email has been checked for viruses by AVG. http://www.avg.com

nile649 commented 6 years ago

Use this website for sitemap example

https://webscraping.com/sitemap.xml

jadegrave commented 6 years ago

Thank you! Very helpful.

Do you have one for the robots.txt?

From: Nilesh Pandey [mailto:notifications@github.com] Sent: Thursday, December 21, 2017 5:02 PM To: kjam/wswp Cc: jadegrave; Author Subject: Re: [kjam/wswp] Chapter 1 robots.txt not available, or sitemap.xml (#6)

Use this website for sitemap example

https://webscraping.com/sitemap.xml

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kjam/wswp/issues/6#issuecomment-353479478 , or mute the thread https://github.com/notifications/unsubscribe-auth/AMjQdu2FIBluu5rwx2R9Fqf1MY39KSsqks5tCuN7gaJpZM4RKXRf . https://github.com/notifications/beacon/AMjQdlKMYzpkXhKJffjwoO9yNfYX2YFWks5tCuN7gaJpZM4RKXRf.gif


This email has been checked for viruses by AVG. http://www.avg.com

jadegrave commented 6 years ago

I see that https://webscraping.com has a robots.txt, but it doesn't have the 'Bad crawler' and other items in it.

kjam commented 6 years ago

Hi all,

Unfortunately, not all links are still available and I am working with the original author to get the site back to perfect shape. In the meantime, this robots.txt is not as described in the book. Sorry about that and I hope you still enjoy working through the other examples!

-katharine

On Fri, Dec 22, 2017 at 1:01 AM jadegrave notifications@github.com wrote:

I see that https://webscraping.com has a robots.txt, but it doesn't have the 'Bad crawler' and other items in it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kjam/wswp/issues/6#issuecomment-353488131, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUW2JyIi2ygQ2kX8avp0vXASsxNEUwmks5tCvE-gaJpZM4RKXRf .

MayaMalkoti commented 4 years ago

the sample sitemap I am trying to scrape have (.gz) file extentions in it. How do I deal with such file types