Open mkarpiarz opened 7 years ago
As I mentioned in https://github.com/niqdev/packtpub-crawler/issues/47#issuecomment-293101790, the newsletter parser gets the book title from the url behind the image cover. https://github.com/niqdev/packtpub-crawler/blob/e604cc1138c5934f7cbe8c210ce1fa6f2caa80b3/script/packtpub.py#L101 This will work fine if the link on the landing page points to the main book page like it was the case here: https://www.packtpub.com/packt/free-ebook/amazon-web-services-free
<a href="/networking-and-servers/mastering-aws-development"> <img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/3632EN_Mastering AWS Development.jpg" class="bookimage" /> </a>
but will yield some unexpected results when this href points to, for example, a cover image - like here: https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2
<a class="fancybox" href="///d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg"> <img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg" class="bookimage" /> </a>
The latter will result in https://github.com/niqdev/packtpub-crawler/blob/e604cc1138c5934f7cbe8c210ce1fa6f2caa80b3/script/packtpub.py#L102 becoming '5612_Wyntkangular_Ebook_500X617.Jpg' instead of the correct title. And a wrong title will also mess up the filename under which the books is written to the disk making it '5612_Wyntkangular_Ebook_500X617.Jpg.{pdf,mobi,epub}'.
An alternative to this would be to use the string inside the h1 tag of the title-bar-title div like here: https://github.com/mkarpiarz/packtpub-crawler/commit/c583d375d02e3e95d8bb0f3988ebc7615a138440. But this also doesn't seem to be always reliable, e.g.:
title-bar-title
<div id="title-bar-title"><h1>Free Amazon Web Services eBook</h1></div>
I would suggest to go for the h1 tag and if for some reason is missing use the other as fallback, maybe removing with a regexp the numbers, the output probably will not be nice but at least it should work
h1
As I mentioned in https://github.com/niqdev/packtpub-crawler/issues/47#issuecomment-293101790, the newsletter parser gets the book title from the url behind the image cover. https://github.com/niqdev/packtpub-crawler/blob/e604cc1138c5934f7cbe8c210ce1fa6f2caa80b3/script/packtpub.py#L101 This will work fine if the link on the landing page points to the main book page like it was the case here: https://www.packtpub.com/packt/free-ebook/amazon-web-services-free
but will yield some unexpected results when this href points to, for example, a cover image - like here: https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2
The latter will result in https://github.com/niqdev/packtpub-crawler/blob/e604cc1138c5934f7cbe8c210ce1fa6f2caa80b3/script/packtpub.py#L102 becoming '5612_Wyntkangular_Ebook_500X617.Jpg' instead of the correct title. And a wrong title will also mess up the filename under which the books is written to the disk making it '5612_Wyntkangular_Ebook_500X617.Jpg.{pdf,mobi,epub}'.
An alternative to this would be to use the string inside the h1 tag of the
title-bar-title
div like here: https://github.com/mkarpiarz/packtpub-crawler/commit/c583d375d02e3e95d8bb0f3988ebc7615a138440. But this also doesn't seem to be always reliable, e.g.: