niqdev / packtpub-crawler

Download your daily free Packt Publishing eBook https://www.packtpub.com/packt/offers/free-learning
MIT License
755 stars 178 forks source link

Error attempting to claim book from newsletter #47

Open lucymhdavies opened 7 years ago

lucymhdavies commented 7 years ago
~ $ python script/spider.py --config config/prod.cfg --notify ifttt --claimOnly

                      __   __              __                                   __
    ____  ____ ______/ /__/ /_____  __  __/ /_        ______________ __      __/ /__  _____
   / __ \/ __ `/ ___/ //_/ __/ __ \/ / / / __ \______/ ___/ ___/ __ `/ | /| / / / _ \/ ___/
  / /_/ / /_/ / /__/ ,< / /_/ /_/ / /_/ / /_/ /_____/ /__/ /  / /_/ /| |/ |/ / /  __/ /
 / .___/\__,_/\___/_/|_|\__/ .___/\__,_/_.___/      \___/_/   \__,_/ |__/|__/_/\___/_/
/_/                       /_/

Download FREE eBook every day from www.packtpub.com
@see github.com/niqdev/packtpub-crawler

[*] 2017-01-31 10:30 - fetching today's eBooks
[*] configuration file: /app/config/prod.cfg
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[+] notification sent to IFTTT
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@123
Traceback (most recent call last):
  File "script/spider.py", line 123, in main
    packtpub.runNewsletter(currentNewsletterUrl)
  File "/app/script/packtpub.py", line 160, in runNewsletter
    self.__parseNewsletterBookInfo(soup)
  File "/app/script/packtpub.py", line 98, in __parseNewsletterBookInfo
    title = urlWithTitle.split('/')[4].replace('-', ' ').title()
IndexError: list index out of range
[+] error notification sent to IFTTT
[*] done
~ $

It has successfully claimed the book from the newsletter already, but on subsequent days I'm getting the above error.

And it sends an IFTTT notification for the second one :(

lucymhdavies commented 7 years ago

Updated to the latest code from master of this repo. Issue still present.

~ $ python script/spider.py --config config/prod.cfg --claimOnly

                      __   __              __                                   __
    ____  ____ ______/ /__/ /_____  __  __/ /_        ______________ __      __/ /__  _____
   / __ \/ __ `/ ___/ //_/ __/ __ \/ / / / __ \______/ ___/ ___/ __ `/ | /| / / / _ \/ ___/
  / /_/ / /_/ / /__/ ,< / /_/ /_/ / /_/ / /_/ /_____/ /__/ /  / /_/ /| |/ |/ / /  __/ /
 / .___/\__,_/\___/_/|_|\__/ .___/\__,_/_.___/      \___/_/   \__,_/ |__/|__/_/\___/_/
/_/                       /_/

Download FREE eBook every day from www.packtpub.com
@see github.com/niqdev/packtpub-crawler

[*] 2017-01-31 10:36 - fetching today's eBooks
[*] configuration file: /app/config/prod.cfg
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@123
Traceback (most recent call last):
  File "script/spider.py", line 123, in main
    packtpub.runNewsletter(currentNewsletterUrl)
  File "/app/script/packtpub.py", line 160, in runNewsletter
    self.__parseNewsletterBookInfo(soup)
  File "/app/script/packtpub.py", line 98, in __parseNewsletterBookInfo
    title = urlWithTitle.split('/')[4].replace('-', ' ').title()
IndexError: list index out of range
[*] done
juzim commented 7 years ago

Can confirm, I'll look into it. For a quickfix, create a file named "lastNewsletterUrl" containing "https://www.packtpub.com/packt/free-ebook/practical-data-analysis" in the config folder. This should stop the errors for now.

juzim commented 7 years ago

Fixed in https://github.com/niqdev/packtpub-crawler/pull/48

lucymhdavies commented 7 years ago

nice. thanks 👍

trancen commented 7 years ago

Getting this error now after updating : on Version - 2.2.4

[+] new download: /home/david/packtpub-crawler//home/david/packtpub-crawler/ebooks/extras/MySQL_for_Python.zip
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@123
Traceback (most recent call last):
  File "script/spider.py", line 123, in main
    packtpub.runNewsletter(currentNewsletterUrl)
  File "/home/david/packtpub-crawler/script/packtpub.py", line 160, in runNewsletter
    self.__parseNewsletterBookInfo(soup)
  File "/home/david/packtpub-crawler/script/packtpub.py", line 98, in __parseNewsletterBookInfo
    title = urlWithTitle.split('/')[4].replace('-', ' ').title()
IndexError: list index out of range
[*] done
juzim commented 7 years ago

Yes, this is the same error. The fix should be merged tonight hopefully or you can change the line yourself.

trancen commented 7 years ago

Sorry just noticed the comment above about created the "lastNewsletterUrl" that worked.

niqdev commented 7 years ago

Thanks @juzim , the problem above is fixed, but actually there is a bug, see the log below (I've hidden some variables/paths)

python script/spider.py -c config/prod.cfg -u drive -s firebase -n gmail

                      __   __              __                                   __
    ____  ____ ______/ /__/ /_____  __  __/ /_        ______________ __      __/ /__  _____
   / __ \/ __ `/ ___/ //_/ __/ __ \/ / / / __ \______/ ___/ ___/ __ `/ | /| / / / _ \/ ___/
  / /_/ / /_/ / /__/ ,< / /_/ /_/ / /_/ / /_/ /_____/ /__/ /  / /_/ /| |/ |/ / /  __/ /
 / .___/\__,_/\___/_/|_|\__/ .___/\__,_/_.___/      \___/_/   \__,_/ |__/|__/_/\___/_/
/_/                       /_/

Download FREE eBook every day from www.packtpub.com
@see github.com/niqdev/packtpub-crawler

[*] 2017-01-31 18:45 - fetching today's eBooks
[*] configuration file: XXX/github/packtpub-crawler/config/prod.cfg
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[-] downloading file from url: https://www.packtpub.com/ebook_download/12890/pdf
[################################] 9985/9985 - 00:00:01
[+] new download: XXX/packtpub-crawler/ebooks/MySQL_for_Python.pdf
[+] new file upload on Drive:
[+] uploading file...
-[+] updating file permissions...
-       [path] XXX/packtpub-crawler/ebooks/MySQL_for_Python.pdf
    [download_url] UUU
    [name] MySQL_for_Python.pdf
    [mime_type] application/pdf
    [id] III
[+] Stored on firebase: KKK
[+] notified to: ['aaa', 'bbb']
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[-] downloading file from url: https://www.packtpub.com/ebook_download/12890/pdf
[################################] 9985/9985 - 00:00:01
[+] new download: XXX/packtpub-crawler/ebooks/Practical_Data_Analysis.pdf
[+] new file upload on Drive:
[+] uploading file...
|[+] updating file permissions...
-       [path] XXX/packtpub-crawler/ebooks/MySQL_for_Python.pdf
    [download_url] https://drive.google.com/uc?id=ZZZ
    [name] MySQL_for_Python.pdf
    [mime_type] application/pdf
    [id] LLL
[+] new file upload on Drive:
[+] uploading file...
|[+] updating file permissions...
\       [path] XXX/packtpub-crawler/ebooks/Practical_Data_Analysis.pdf
    [download_url] https://drive.google.com/uc?id=DDD
    [name] Practical_Data_Analysis.pdf
    [mime_type] application/pdf
    [id] YYY
[+] Stored on firebase: WWW
[+] notified to: ['aaa', 'bbb']
[*] done
niqdev commented 7 years ago

Actually I started the script again and it seems that the 3 book are identical and the daily ebook is ignored and I have 3 copy of the same book (newsletter). Moreover I checked on firebase and the data uploaded are mixed e.g. different filename but same author (of the newsletter).

juzim commented 7 years ago

I think we have to reset the Packtpub instance before handling the newsletter but then this would have been broken for weeks. Could you check?

Also, could you comment out spider.py:22 so we can see the contents of each handled packtpub.info?

niqdev commented 7 years ago

Yep, so first log (see the author is wrong) and noticed now paths is empty

...
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
{
  "url_source_code": "https://www.packtpub.com/code_download/12891", 
  "paths": [], 
  "description": "MySQL for Python is the essential ingredient for building productive and feature-rich Python applications as it provides powerful database support and will also take the burden off your webserver. This eBook shows how to boost the productivity and maintainability of your Python apps by integrating them with the MySQL database server. It will take you from installing MySQL for Python on your platform of choice all the way through to database automation and administration. Every chapter is illustrated with a practical project that you can use during your own app development process. This eBook is free for today only so don\u2019t miss out!", 
  "title": "MySQL for Python", 
  "author": "Hector Cuesta", 
  "filename": "MySQL_for_Python", 
  "book_id": "12890", 
  "url_claim": "https://www.packtpub.com/freelearning-claim/5286/21478", 
  "url_image": "https://dz13w8afd47il.cloudfront.net/sites/default/files/imagecache/dotd_main_image/0189OS_MockupCover_0.jpg"
}
[+] book successfully claimed
...

and second log

...
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
{
  "url_source_code": "https://www.packtpub.com/code_download/12891", 
  "paths": [
    "XXX/packtpub-crawler/ebooks/MySQL_for_Python.pdf"
  ], 
  "description": "Get started in data analysis with this free 360 page eBook guide\nFor small businesses, analyzing the information contained in their data using open source technology could be game-changing. All you need is some basic programming and mathematical skills to do just that. This free data analysis eBook is designed to give you the knowledge you need to start succeeding in data analysis. Discover the tools, techniques and algorithms you need to transform your data into insight.\n\nVisualize your data to find trends and correlations\nBuild your own image similarity search engine\nLearn how to forecast numerical values from time series data\nCreate an interactive visualization for your social media graph", 
  "title": "Practical Data Analysis", 
  "author": "Hector Cuesta", 
  "filename": "Practical_Data_Analysis", 
  "book_id": "12890", 
  "url_claim": "https://www.packtpub.com/promo/claim/12891/27564", 
  "url_image": "https://d1ldz4te4covpm.cloudfront.net/sites/default/files/B02731_Practical Data Analysis.jpg"
}
[+] book successfully claimed
...
niqdev commented 7 years ago

About the first question, we probably have to reset everything before a new claim, but is this the first newsletter since they reactiveted the free ebook? The other books seems correct, I checked also the logs on heroku

niqdev commented 7 years ago

Also the field url_source_code is wrong

juzim commented 7 years ago

Can you add packtpub = Packtpub(config, args.dev) to spider.py:123 and see if it fixes it? Sorry, but somehow the tests on my machine are broken...

juzim commented 7 years ago

Did you by any chance delete the lastNewsletterUrl file? Because if the script tries to grab an already claimed newsletter book from the archive, it won't find it on the top position and overwrite the data for whatever book is currently on there.

We have to either check if the book was already claimed (would be a nice feature anyways) or find the book in the list by name/id/etc.

I'll try to look into it tomorrow but it shouldn't happen again unless you delete the file.

niqdev commented 7 years ago

I'm testing now, using docker, with the change that you suggested i.e. add lines 123

packtpub = Packtpub(config, args.dev)
packtpub.runNewsletter(currentNewsletterUrl)

I have the following error

...
[+] book successfully claimed
[+] created new directory: /packtpub-crawler/ebooks
[-] downloading file from url: https://www.packtpub.com/ebook_download/12890/pdf
[+] new download: /packtpub-crawler/ebooks/MySQL_for_Python.pdf
[+] new file upload on Drive:
\[+] uploading file...
\[+] updating file permissions...
/Traceback (most recent call last):
  File "script/spider.py", line 124, in main
    packtpub.runNewsletter(currentNewsletterUrl)
  File "/packtpub-crawler/script/packtpub.py", line 160, in runNewsletter
    self.__parseNewsletterBookInfo(soup)
  File "/packtpub-crawler/script/packtpub.py", line 105, in __parseNewsletterBookInfo
    self.info['url_claim'] = self.__url_base + claimNode[0]['href']
IndexError: list index out of range
    [path] /packtpub-crawler/ebooks/MySQL_for_Python.pdf
...
juzim commented 7 years ago

When we reset the whole packtpub we also loose the login information, so it won't work. I added a method to reset the packtpub.info but this won't solve the other issue.

niqdev commented 7 years ago

We should probably just reset in Packtpub.py

self.info = {
  'paths': []
}

What do you think?

niqdev commented 7 years ago

No same problem, we lose the session

juzim commented 7 years ago

I think the solution is to see if the book is already claimed before further processing it. The claim response page contains an error message that we can parse. I try to submit a patch tomorrow.

niqdev commented 7 years ago

About your solution I just don't like the fact that we have to do another request, we should be able to reset all the fields before. This is just my thought, but this is were mutable state sucks (we are also missing tests) and a purely functional approach would help us a lot. By the way, I'm not gonna rewrite anything..haha

juzim commented 7 years ago

We don't need another request, the claim request returns the archive with the error message (if the book was already claimed). So it's just another check in get_claim that will throw an exception that prevents further processing.

This has some small downsides but will fix your issue.

niqdev notifications@github.com schrieb am Mi., 1. Feb. 2017, 09:14:

About your solution I just don't like the fact that we have to do another request, we should be able to reset all the fields before. This is just my thought, but this is were mutable state sucks (we are also missing tests) and a purely functional approach would help us a lot. By the way, I'm not gonna rewrite anything..haha

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/niqdev/packtpub-crawler/issues/47#issuecomment-276597493, or mute the thread https://github.com/notifications/unsubscribe-auth/AEmPPA9AkfIBUjuV_QpYgZTOPphG9nYfks5rYD7rgaJpZM4Lydlb .

juzim commented 7 years ago

While I'm not a fan of flow control by exceptions, I think it works just fine in this case. Your code is solid and can surely handle quite some more beating before a rewrite would be necessary.

Fixed in https://github.com/niqdev/packtpub-crawler/pull/50

This fix looks for a specific error message on the claim result page (which curiously only exists for the newsletter, not the dailies) which should work for now. No additional requests are made.

Assuming that the first entry in the archive is always the book we are processing at the moment might cause further trouble (packtpub might switch to alphabetic sorting for example). But the only way to see if the book was already claimed in the archive right now is searching by name which is prone to errors since we parse the title from the claim URL. In my opinion, the right way to do this is not parsing the latest entry, but synchronizing the archive with the local downloads folder (stepping through all books on the page and downloading those that are missing) after a claim. Since we can generate the file name from the list entry title instead of the claim URL this way, we can securely match them.

This would also resolve https://github.com/niqdev/packtpub-crawler/issues/23

Any volunteers? :)

lucymhdavies commented 7 years ago

In my opinion, the right way to do this is not parsing the latest entry, but synchronizing the archive with the local downloads folder (stepping through all books on the page and downloading those that are missing) after a claim.

How would that work if you're running with --claimOnly?

juzim commented 7 years ago

The script would just claim the book and you can download it later manually or run it with a "downloadAll" parameter that only syncs the archive with the local folder. Notifications etc are handled on claim, not download.

niqdev commented 7 years ago

@juzim , I'll keep monitoring until next newsletter. I'll create tag 2.2.5. About your solution with the local search is fine, about me, how you have seen, I haven;t much time in this period and working on other projects aswell. If you leave it there I may do it, just don;t know when. Thanks

niqdev commented 7 years ago

Just to keep track, it needs further investigation. Anyway sometimes the script is able to download the newsletter, while for example this week there is this error

[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@125
Traceback (most recent call last):
  File "script/spider.py", line 125, in main
    packtpub.runNewsletter(currentNewsletterUrl)
  File "PATH/packtpub-crawler/script/packtpub.py", line 169, in runNewsletter
    self.__parseNewsletterBookInfo(soup)
  File "PATH/packtpub-crawler/script/packtpub.py", line 101, in __parseNewsletterBookInfo
    urlWithTitle = div_target.select('div.promo-landing-book-picture a')[0]['href']
IndexError: list index out of range
juzim commented 7 years ago

I bet they are doing a/b tests which makes it hard to reproduce. I think claiming still works despite the error, can you confirm?

niqdev commented 7 years ago

No, unfortunately the claiming is not working too.

niqdev commented 7 years ago

The div promo-landing-book-picture doesn't exists

juzim commented 7 years ago

That's it?! I'll try to fix it soon but it might take till next week, sorry.

niqdev notifications@github.com schrieb am So., 2. Apr. 2017, 11:10:

The div promo-landing-book-picture doesn't exists

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/niqdev/packtpub-crawler/issues/47#issuecomment-290974349, or mute the thread https://github.com/notifications/unsubscribe-auth/AEmPPB4hdhLsEjGopseM72lUW5HhNEgvks5rr2YVgaJpZM4Lydlb .

mkarpiarz commented 7 years ago

Looks like some of the divs has been renamed on the newsletter's landing page. I compared the page for an older book:

    <div class="book-top-block-wrapper cf">
        <div class="cf section-inner">
            <div class="float-left promo-landing-book-picture">
                <div itemprop="image" itemtype="http://schema.org/URL" itemscope>
                    <a href="/web/20170113204509/https://dz13w8afd47il.cloudfront.net/networking-and-servers/mastering-aws-development">
                        <img src="/web/20170113204509im_/https://d1ldz4te4covpm.cloudfront.net/sites/default/files/3632EN_Mastering%20AWS%20Development.jpg" class="bookimage" />
                    </a>
                </div>
            <div class="float-left promo-landing-book-info">
                <div class="promo-landing-book-body-title">
                                    </div>
                <div class="promo-landing-book-body">
                    <div><h1>Claim your free 416 page Amazon Web Services eBook!</h1>
<p>This book is a practical guide to developing, administering, and managing applications and infrastructures with AWS. With this, you'll be able to create, design, and manage an entire application life cycle on AWS by using the AWS SDKs, APIs, and the AWS Management Console.</p>
</div>
                </div>
                            </div>

with the current one:

<div id="main-book" class="cf nano" itemscope itemtype="http://schema.org/Book">
    <div class="book-top-block-wrapper cf">
        <div class="cf section-inner">
            <div class="float-left nano-book-main-image">
                <div itemprop="image" itemtype="http://schema.org/URL" itemscope>
                    <a class="fancybox" href="///d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg">
                        <img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg" class="bookimage" />
                    </a>
                </div>
            <div class="float-left nano-book-text">
                <h1>What you need to know about Angular 2</h1>
                <div><strong>Get to grips with the ins and outs of one of the biggest web dev revolutions of this decade with the aid of this free eGuide! From setting up the very basics of Angular to making the most of Directives and Components you’ll discover everything you need to get started building your own web apps today.</strong></div>
                <div id="nano-learn">
                    <div id="nano-learn-title">
                        <div id="nano-learn-title-text">
                            <span id="nano-learn-title-text-inner">
                                What You Will Learn                            </span>
                        </div>
                    </div>

and came up with this hotfix: https://github.com/niqdev/packtpub-crawler/compare/master...mkarpiarz:fix_newsletter_divs I haven't tested email notifications yet, so I'm not sure how the description would look like, but claiming a newsletter ebook seems to work now. Happy to submit a PR if @juzim haven't started working on this yet.

juzim commented 7 years ago

That would be great, thanks!

CrazySerGo commented 7 years ago

Hi Guys, I'm creating google script that parsing PacktPab tweets(it comes from @juzim google script). I'm not sure but there is a chance that all books from newsletters also will be published on their Twitter and no needs to fix it :) joking. It's not finished - should exclude duplicates and check does link still available or not. If you have time, please look on output if it's fine for crawler or not https://goo.gl/AXtAC8

juzim commented 7 years ago

The link doesn't work for me, can you create a pull request please?

Also, while there are tons of free books on the feed, they repeat a lot so we have to make sure the duplication check works.

mkarpiarz commented 7 years ago

Is there a reason for the newsletter spreadsheet being empty even though this week's free ebook is still available under https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2?

niqdev commented 7 years ago

@mkarpiarz before merge the PR can you please confirm that the email/notifications are still working? Thanks

juzim commented 7 years ago

@mkarpiarz I removed it to prevent error messages until the issue is fixed

mkarpiarz commented 7 years ago

@juzim - that's fine for now since there is an option to self-host the file.

I haven't tested email notification yet, @niqdev, but I printed out all the variables in this __parseNewsletterBookInfo method and I noticed this in the output:

self.info['title']: u'5612_Wyntkangular_Ebook_500X617.Jpg'
self.info['filename']: '5612_Wyntkangular_Ebook_500X617.Jpg'

This is because the code from: https://github.com/niqdev/packtpub-crawler/blob/e604cc1138c5934f7cbe8c210ce1fa6f2caa80b3/script/packtpub.py#L101 is extracting book titles based on the url inside the class where the book cover is and for this week's free ebook the relevant part looks like this:

<a class="fancybox" href="///d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg">
  <img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg" class="bookimage" />
</a>

So there is no link with the book title and instead the url points to the location of the cover image.

I'll create a separate thread for this title parsing issue.