maria-antoniak / goodreads-scraper

A Python scraper for Goodreads books and reviews.
GNU General Public License v3.0
274 stars 83 forks source link

AttributeError: 'NoneType' object has no attribute 'text' #18

Closed andreasvc closed 3 years ago

andreasvc commented 3 years ago
2021-05-11 00:42:52.788691 get_reviews.py: Scraping 35839437-burn-bright...
2021-05-11 00:42:52.788727 get_reviews.py: #9 out of 26 books
Scraped page 1
Scraped page 2
Scraped page 3
Scraped page 4
Scraped page 5
Scraped page 6
Scraped page 7
Traceback (most recent call last):
  File "get_reviews.py", line 313, in <module>
    main()
  File "get_reviews.py", line 287, in main
    reviews = get_reviews_first_ten_pages(driver, book_id, args.sort_order)
  File "get_reviews.py", line 176, in get_reviews_first_ten_pages
    reviews += scrape_reviews_on_current_page(driver, url, book_id, sort_order)
  File "get_reviews.py", line 118, in scrape_reviews_on_current_page
    book_title = soup.find(id='bookTitle').text.strip()
AttributeError: 'NoneType' object has no attribute 'text'
maria-antoniak commented 3 years ago

Hi! It looks like something is breaking for that particular book title. You can check the source for that book here. Everything looks fine as far as I can tell (there is indeed a field with id='bookTitle'), so I'm not sure why it's breaking and sadly can't investigate further right now. One thought is that while the script should handle pop-ups, you could try running again.

andreasvc commented 3 years ago

No worries. Scraping is always messy and fussy.

I added a check for None, and now it keeps re-scraping the book:

python3 get_reviews.py --book_ids_path mybooks.txt -
-output_directory_path /tmp --format csv --browser firefox
2021-05-11 00:59:10.269034 get_reviews.py: Scraping 35839437-burn-bright...
2021-05-11 00:59:10.269099 get_reviews.py: #9 out of 19 books
Scraped page 1
Scraped page 2
Scraped page 3
Scraped page 4
Scraped page 5
Scraped page 6
Scraped page 7
Scraped page 8
🚨 ElementNotInteractableException🚨
🔄 Refreshing Goodreads site and rescraping book🔄
Scraped page 1
Scraped page 2
Scraped page 3
Scraped page 4
Scraped page 5
Scraped page 6
Scraped page 7
Scraped page 8
🚨 ElementNotInteractableException🚨

After 5 iterations of this I manually aborted. I don't have time to investigate either, so I removed this book from the list; scraping the other books did work fortunately.

maria-antoniak commented 3 years ago

If it's ok to leave it out, sometimes that's the easiest solution! And if you end up figuring out, we welcome pull requests.

maria-antoniak commented 3 years ago

Closing unless this proves to be a more generalized problem.