Open lucymhdavies opened 7 years ago
Updated to the latest code from master
of this repo. Issue still present.
~ $ python script/spider.py --config config/prod.cfg --claimOnly
__ __ __ __
____ ____ ______/ /__/ /_____ __ __/ /_ ______________ __ __/ /__ _____
/ __ \/ __ `/ ___/ //_/ __/ __ \/ / / / __ \______/ ___/ ___/ __ `/ | /| / / / _ \/ ___/
/ /_/ / /_/ / /__/ ,< / /_/ /_/ / /_/ / /_/ /_____/ /__/ / / /_/ /| |/ |/ / / __/ /
/ .___/\__,_/\___/_/|_|\__/ .___/\__,_/_.___/ \___/_/ \__,_/ |__/|__/_/\___/_/
/_/ /_/
Download FREE eBook every day from www.packtpub.com
@see github.com/niqdev/packtpub-crawler
[*] 2017-01-31 10:36 - fetching today's eBooks
[*] configuration file: /app/config/prod.cfg
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@123
Traceback (most recent call last):
File "script/spider.py", line 123, in main
packtpub.runNewsletter(currentNewsletterUrl)
File "/app/script/packtpub.py", line 160, in runNewsletter
self.__parseNewsletterBookInfo(soup)
File "/app/script/packtpub.py", line 98, in __parseNewsletterBookInfo
title = urlWithTitle.split('/')[4].replace('-', ' ').title()
IndexError: list index out of range
[*] done
Can confirm, I'll look into it. For a quickfix, create a file named "lastNewsletterUrl" containing "https://www.packtpub.com/packt/free-ebook/practical-data-analysis" in the config folder. This should stop the errors for now.
nice. thanks 👍
Getting this error now after updating : on Version - 2.2.4
[+] new download: /home/david/packtpub-crawler//home/david/packtpub-crawler/ebooks/extras/MySQL_for_Python.zip
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@123
Traceback (most recent call last):
File "script/spider.py", line 123, in main
packtpub.runNewsletter(currentNewsletterUrl)
File "/home/david/packtpub-crawler/script/packtpub.py", line 160, in runNewsletter
self.__parseNewsletterBookInfo(soup)
File "/home/david/packtpub-crawler/script/packtpub.py", line 98, in __parseNewsletterBookInfo
title = urlWithTitle.split('/')[4].replace('-', ' ').title()
IndexError: list index out of range
[*] done
Yes, this is the same error. The fix should be merged tonight hopefully or you can change the line yourself.
Sorry just noticed the comment above about created the "lastNewsletterUrl" that worked.
Thanks @juzim , the problem above is fixed, but actually there is a bug, see the log below (I've hidden some variables/paths)
python script/spider.py -c config/prod.cfg -u drive -s firebase -n gmail
__ __ __ __
____ ____ ______/ /__/ /_____ __ __/ /_ ______________ __ __/ /__ _____
/ __ \/ __ `/ ___/ //_/ __/ __ \/ / / / __ \______/ ___/ ___/ __ `/ | /| / / / _ \/ ___/
/ /_/ / /_/ / /__/ ,< / /_/ /_/ / /_/ / /_/ /_____/ /__/ / / /_/ /| |/ |/ / / __/ /
/ .___/\__,_/\___/_/|_|\__/ .___/\__,_/_.___/ \___/_/ \__,_/ |__/|__/_/\___/_/
/_/ /_/
Download FREE eBook every day from www.packtpub.com
@see github.com/niqdev/packtpub-crawler
[*] 2017-01-31 18:45 - fetching today's eBooks
[*] configuration file: XXX/github/packtpub-crawler/config/prod.cfg
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[-] downloading file from url: https://www.packtpub.com/ebook_download/12890/pdf
[################################] 9985/9985 - 00:00:01
[+] new download: XXX/packtpub-crawler/ebooks/MySQL_for_Python.pdf
[+] new file upload on Drive:
[+] uploading file...
-[+] updating file permissions...
- [path] XXX/packtpub-crawler/ebooks/MySQL_for_Python.pdf
[download_url] UUU
[name] MySQL_for_Python.pdf
[mime_type] application/pdf
[id] III
[+] Stored on firebase: KKK
[+] notified to: ['aaa', 'bbb']
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[-] downloading file from url: https://www.packtpub.com/ebook_download/12890/pdf
[################################] 9985/9985 - 00:00:01
[+] new download: XXX/packtpub-crawler/ebooks/Practical_Data_Analysis.pdf
[+] new file upload on Drive:
[+] uploading file...
|[+] updating file permissions...
- [path] XXX/packtpub-crawler/ebooks/MySQL_for_Python.pdf
[download_url] https://drive.google.com/uc?id=ZZZ
[name] MySQL_for_Python.pdf
[mime_type] application/pdf
[id] LLL
[+] new file upload on Drive:
[+] uploading file...
|[+] updating file permissions...
\ [path] XXX/packtpub-crawler/ebooks/Practical_Data_Analysis.pdf
[download_url] https://drive.google.com/uc?id=DDD
[name] Practical_Data_Analysis.pdf
[mime_type] application/pdf
[id] YYY
[+] Stored on firebase: WWW
[+] notified to: ['aaa', 'bbb']
[*] done
Actually I started the script again and it seems that the 3 book are identical and the daily ebook is ignored and I have 3 copy of the same book (newsletter). Moreover I checked on firebase and the data uploaded are mixed e.g. different filename but same author (of the newsletter).
I think we have to reset the Packtpub instance before handling the newsletter but then this would have been broken for weeks. Could you check?
Also, could you comment out spider.py:22 so we can see the contents of each handled packtpub.info?
Yep, so first log (see the author is wrong) and noticed now paths is empty
...
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
{
"url_source_code": "https://www.packtpub.com/code_download/12891",
"paths": [],
"description": "MySQL for Python is the essential ingredient for building productive and feature-rich Python applications as it provides powerful database support and will also take the burden off your webserver. This eBook shows how to boost the productivity and maintainability of your Python apps by integrating them with the MySQL database server. It will take you from installing MySQL for Python on your platform of choice all the way through to database automation and administration. Every chapter is illustrated with a practical project that you can use during your own app development process. This eBook is free for today only so don\u2019t miss out!",
"title": "MySQL for Python",
"author": "Hector Cuesta",
"filename": "MySQL_for_Python",
"book_id": "12890",
"url_claim": "https://www.packtpub.com/freelearning-claim/5286/21478",
"url_image": "https://dz13w8afd47il.cloudfront.net/sites/default/files/imagecache/dotd_main_image/0189OS_MockupCover_0.jpg"
}
[+] book successfully claimed
...
and second log
...
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
{
"url_source_code": "https://www.packtpub.com/code_download/12891",
"paths": [
"XXX/packtpub-crawler/ebooks/MySQL_for_Python.pdf"
],
"description": "Get started in data analysis with this free 360 page eBook guide\nFor small businesses, analyzing the information contained in their data using open source technology could be game-changing. All you need is some basic programming and mathematical skills to do just that. This free data analysis eBook is designed to give you the knowledge you need to start succeeding in data analysis. Discover the tools, techniques and algorithms you need to transform your data into insight.\n\nVisualize your data to find trends and correlations\nBuild your own image similarity search engine\nLearn how to forecast numerical values from time series data\nCreate an interactive visualization for your social media graph",
"title": "Practical Data Analysis",
"author": "Hector Cuesta",
"filename": "Practical_Data_Analysis",
"book_id": "12890",
"url_claim": "https://www.packtpub.com/promo/claim/12891/27564",
"url_image": "https://d1ldz4te4covpm.cloudfront.net/sites/default/files/B02731_Practical Data Analysis.jpg"
}
[+] book successfully claimed
...
About the first question, we probably have to reset everything before a new claim, but is this the first newsletter since they reactiveted the free ebook? The other books seems correct, I checked also the logs on heroku
Also the field url_source_code
is wrong
Can you add packtpub = Packtpub(config, args.dev)
to spider.py:123 and see if it fixes it? Sorry, but somehow the tests on my machine are broken...
Did you by any chance delete the lastNewsletterUrl file? Because if the script tries to grab an already claimed newsletter book from the archive, it won't find it on the top position and overwrite the data for whatever book is currently on there.
We have to either check if the book was already claimed (would be a nice feature anyways) or find the book in the list by name/id/etc.
I'll try to look into it tomorrow but it shouldn't happen again unless you delete the file.
I'm testing now, using docker, with the change that you suggested i.e. add lines 123
packtpub = Packtpub(config, args.dev)
packtpub.runNewsletter(currentNewsletterUrl)
I have the following error
...
[+] book successfully claimed
[+] created new directory: /packtpub-crawler/ebooks
[-] downloading file from url: https://www.packtpub.com/ebook_download/12890/pdf
[+] new download: /packtpub-crawler/ebooks/MySQL_for_Python.pdf
[+] new file upload on Drive:
\[+] uploading file...
\[+] updating file permissions...
/Traceback (most recent call last):
File "script/spider.py", line 124, in main
packtpub.runNewsletter(currentNewsletterUrl)
File "/packtpub-crawler/script/packtpub.py", line 160, in runNewsletter
self.__parseNewsletterBookInfo(soup)
File "/packtpub-crawler/script/packtpub.py", line 105, in __parseNewsletterBookInfo
self.info['url_claim'] = self.__url_base + claimNode[0]['href']
IndexError: list index out of range
[path] /packtpub-crawler/ebooks/MySQL_for_Python.pdf
...
When we reset the whole packtpub we also loose the login information, so it won't work. I added a method to reset the packtpub.info but this won't solve the other issue.
We should probably just reset in Packtpub.py
self.info = {
'paths': []
}
What do you think?
No same problem, we lose the session
I think the solution is to see if the book is already claimed before further processing it. The claim response page contains an error message that we can parse. I try to submit a patch tomorrow.
About your solution I just don't like the fact that we have to do another request, we should be able to reset all the fields before. This is just my thought, but this is were mutable state sucks (we are also missing tests) and a purely functional approach would help us a lot. By the way, I'm not gonna rewrite anything..haha
We don't need another request, the claim request returns the archive with the error message (if the book was already claimed). So it's just another check in get_claim that will throw an exception that prevents further processing.
This has some small downsides but will fix your issue.
niqdev notifications@github.com schrieb am Mi., 1. Feb. 2017, 09:14:
About your solution I just don't like the fact that we have to do another request, we should be able to reset all the fields before. This is just my thought, but this is were mutable state sucks (we are also missing tests) and a purely functional approach would help us a lot. By the way, I'm not gonna rewrite anything..haha
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/niqdev/packtpub-crawler/issues/47#issuecomment-276597493, or mute the thread https://github.com/notifications/unsubscribe-auth/AEmPPA9AkfIBUjuV_QpYgZTOPphG9nYfks5rYD7rgaJpZM4Lydlb .
While I'm not a fan of flow control by exceptions, I think it works just fine in this case. Your code is solid and can surely handle quite some more beating before a rewrite would be necessary.
Fixed in https://github.com/niqdev/packtpub-crawler/pull/50
This fix looks for a specific error message on the claim result page (which curiously only exists for the newsletter, not the dailies) which should work for now. No additional requests are made.
Assuming that the first entry in the archive is always the book we are processing at the moment might cause further trouble (packtpub might switch to alphabetic sorting for example). But the only way to see if the book was already claimed in the archive right now is searching by name which is prone to errors since we parse the title from the claim URL. In my opinion, the right way to do this is not parsing the latest entry, but synchronizing the archive with the local downloads folder (stepping through all books on the page and downloading those that are missing) after a claim. Since we can generate the file name from the list entry title instead of the claim URL this way, we can securely match them.
This would also resolve https://github.com/niqdev/packtpub-crawler/issues/23
Any volunteers? :)
In my opinion, the right way to do this is not parsing the latest entry, but synchronizing the archive with the local downloads folder (stepping through all books on the page and downloading those that are missing) after a claim.
How would that work if you're running with --claimOnly
?
The script would just claim the book and you can download it later manually or run it with a "downloadAll" parameter that only syncs the archive with the local folder. Notifications etc are handled on claim, not download.
@juzim , I'll keep monitoring until next newsletter. I'll create tag 2.2.5
.
About your solution with the local search is fine, about me, how you have seen, I haven;t much time in this period and working on other projects aswell. If you leave it there I may do it, just don;t know when.
Thanks
Just to keep track, it needs further investigation. Anyway sometimes the script is able to download the newsletter, while for example this week there is this error
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@125
Traceback (most recent call last):
File "script/spider.py", line 125, in main
packtpub.runNewsletter(currentNewsletterUrl)
File "PATH/packtpub-crawler/script/packtpub.py", line 169, in runNewsletter
self.__parseNewsletterBookInfo(soup)
File "PATH/packtpub-crawler/script/packtpub.py", line 101, in __parseNewsletterBookInfo
urlWithTitle = div_target.select('div.promo-landing-book-picture a')[0]['href']
IndexError: list index out of range
I bet they are doing a/b tests which makes it hard to reproduce. I think claiming still works despite the error, can you confirm?
No, unfortunately the claiming is not working too.
The div promo-landing-book-picture
doesn't exists
That's it?! I'll try to fix it soon but it might take till next week, sorry.
niqdev notifications@github.com schrieb am So., 2. Apr. 2017, 11:10:
The div promo-landing-book-picture doesn't exists
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/niqdev/packtpub-crawler/issues/47#issuecomment-290974349, or mute the thread https://github.com/notifications/unsubscribe-auth/AEmPPB4hdhLsEjGopseM72lUW5HhNEgvks5rr2YVgaJpZM4Lydlb .
Looks like some of the divs has been renamed on the newsletter's landing page. I compared the page for an older book:
<div class="book-top-block-wrapper cf">
<div class="cf section-inner">
<div class="float-left promo-landing-book-picture">
<div itemprop="image" itemtype="http://schema.org/URL" itemscope>
<a href="/web/20170113204509/https://dz13w8afd47il.cloudfront.net/networking-and-servers/mastering-aws-development">
<img src="/web/20170113204509im_/https://d1ldz4te4covpm.cloudfront.net/sites/default/files/3632EN_Mastering%20AWS%20Development.jpg" class="bookimage" />
</a>
</div>
<div class="float-left promo-landing-book-info">
<div class="promo-landing-book-body-title">
</div>
<div class="promo-landing-book-body">
<div><h1>Claim your free 416 page Amazon Web Services eBook!</h1>
<p>This book is a practical guide to developing, administering, and managing applications and infrastructures with AWS. With this, you'll be able to create, design, and manage an entire application life cycle on AWS by using the AWS SDKs, APIs, and the AWS Management Console.</p>
</div>
</div>
</div>
with the current one:
<div id="main-book" class="cf nano" itemscope itemtype="http://schema.org/Book">
<div class="book-top-block-wrapper cf">
<div class="cf section-inner">
<div class="float-left nano-book-main-image">
<div itemprop="image" itemtype="http://schema.org/URL" itemscope>
<a class="fancybox" href="///d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg">
<img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg" class="bookimage" />
</a>
</div>
<div class="float-left nano-book-text">
<h1>What you need to know about Angular 2</h1>
<div><strong>Get to grips with the ins and outs of one of the biggest web dev revolutions of this decade with the aid of this free eGuide! From setting up the very basics of Angular to making the most of Directives and Components you’ll discover everything you need to get started building your own web apps today.</strong></div>
<div id="nano-learn">
<div id="nano-learn-title">
<div id="nano-learn-title-text">
<span id="nano-learn-title-text-inner">
What You Will Learn </span>
</div>
</div>
and came up with this hotfix: https://github.com/niqdev/packtpub-crawler/compare/master...mkarpiarz:fix_newsletter_divs I haven't tested email notifications yet, so I'm not sure how the description would look like, but claiming a newsletter ebook seems to work now. Happy to submit a PR if @juzim haven't started working on this yet.
That would be great, thanks!
Hi Guys, I'm creating google script that parsing PacktPab tweets(it comes from @juzim google script). I'm not sure but there is a chance that all books from newsletters also will be published on their Twitter and no needs to fix it :) joking. It's not finished - should exclude duplicates and check does link still available or not. If you have time, please look on output if it's fine for crawler or not https://goo.gl/AXtAC8
The link doesn't work for me, can you create a pull request please?
Also, while there are tons of free books on the feed, they repeat a lot so we have to make sure the duplication check works.
Is there a reason for the newsletter spreadsheet being empty even though this week's free ebook is still available under https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2?
@mkarpiarz before merge the PR can you please confirm that the email/notifications are still working? Thanks
@mkarpiarz I removed it to prevent error messages until the issue is fixed
@juzim - that's fine for now since there is an option to self-host the file.
I haven't tested email notification yet, @niqdev, but I printed out all the variables in this __parseNewsletterBookInfo
method and I noticed this in the output:
self.info['title']: u'5612_Wyntkangular_Ebook_500X617.Jpg'
self.info['filename']: '5612_Wyntkangular_Ebook_500X617.Jpg'
This is because the code from: https://github.com/niqdev/packtpub-crawler/blob/e604cc1138c5934f7cbe8c210ce1fa6f2caa80b3/script/packtpub.py#L101 is extracting book titles based on the url inside the class where the book cover is and for this week's free ebook the relevant part looks like this:
<a class="fancybox" href="///d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg">
<img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg" class="bookimage" />
</a>
So there is no link with the book title and instead the url points to the location of the cover image.
I'll create a separate thread for this title parsing issue.
It has successfully claimed the book from the newsletter already, but on subsequent days I'm getting the above error.
And it sends an IFTTT notification for the second one :(