ptrstn / dailyblink

Downloads the Audio and Text of the Free Daily book from Blinkist.com
MIT License
43 stars 6 forks source link

Daily Blink Page Layout has changed - IndexError: list index out of range #32

Open ptrstn opened 2 years ago

ptrstn commented 2 years ago

The Layout and URL of the Free Daily Page has changed.

New URL: https://www.blinkist.com/en/content/daily

The locator attribute values for BeautifulSoup have to be updated accordingly. Previous values are no longer valid and cause an IndexError:

    def _create_blink_info(response_text):
        soup = BeautifulSoup(response_text, "html.parser")
>       daily_book_href = soup.find_all("a", {"class": "daily-book__cta"})[0]["href"]
E       IndexError: list index out of range
kotzer3 commented 2 years ago

confirmed, having this also since... 22.05.2022, because last folder i have in my library is: '2022-05-21 - Finde den Weg zu deiner inneren Mitte'/

root@banane:~# python3 -m dailyblink
dailyblink v1.2.1, Python 3.9.2, Linux armv7l 32bit ELF
Downloading the free daily Blinks on 2022-06-04 22:47:32...
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/__main__.py", line 67, in <module>
    main()
  File "/root/.local/lib/python3.9/site-packages/dailyblink/__main__.py", line 63, in main
    blinkist_scraper.download_daily_blinks(args.language, base_path)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 37, in download_daily_blinks
    self._attempt_daily_blinks_download(languages, base_path)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 56, in _attempt_daily_blinks_download
    self._download_daily_blinks(language_code, base_path)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 63, in _download_daily_blinks
    blink_info = self._get_daily_blink_info(language=language_code)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 126, in _get_daily_blink_info
    return _create_blink_info(response.text)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 171, in _create_blink_info
    daily_book_href = soup.find_all("a", {"class": "daily-book__cta"})[0]["href"]
IndexError: list index out of range
root@banane:~#
Erik262 commented 2 years ago

Jap same here. How to fix this?

NicoWeio commented 2 years ago

I was able to retrieve audio and text content for the free daily by calling Blinkist's API the way the frontend does. I prefer this over BeautifulSoup because it's more direct and the new DOM lacks descriptive classes/IDs. However, I haven't integrated my approach with this codebase, and I'm not sure if it works the same for arbitrary books on Blinkist Premium. If anyone's interested, I'll post my code tomorrow. :)

Erik262 commented 2 years ago

I was able to retrieve audio and text content for the free daily by calling Blinkist's API the way the frontend does. I prefer this over BeautifulSoup because it's more direct and the new DOM lacks descriptive classes/IDs. However, I haven't integrated my approach with this codebase, and I'm not sure if it works the same for arbitrary books on Blinkist Premium. If anyone's interested, I'll post my code tomorrow. :)

Perfect, let me please know!

NicoWeio commented 2 years ago

Here you go. :)

⚠️ Update: I've created a repo with updated code here

Again, I haven't tried other values for User-Agent yet, and I can't check whether this approach will work for Premium content.

import cloudscraper
from datetime import datetime
from pathlib import Path
import requests
from rich import print
from rich.progress import track

BASE_URL = 'https://www.blinkist.com/'

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0',
    'x-requested-with': 'XMLHttpRequest',
}

LOCALES = ['en', 'de']
DOWNLOAD_DIR = Path.home() / 'Musik' / 'Blinkist'

scraper = cloudscraper.create_scraper()

def get_book_dir(book):
    return DOWNLOAD_DIR / f"{datetime.today().strftime('%Y-%m-%d')} – {book['slug']}"

def get_free_daily(locale):
    # see also: https://www.blinkist.com/en/content/daily
    response = scraper.get(
        BASE_URL + 'api/free_daily',
        params={'locale': locale}
    )
    return response.json()

def get_chapters(book_slug):
    url = f"{BASE_URL}/api/books/{book_slug}/chapters"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()['chapters']

def get_chapter(book_id, chapter_id):
    url = f"{BASE_URL}/api/books/{book_id}/chapters/{chapter_id}"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()

def download_chapter_audio(book, chapter_data):
    book_dir = get_book_dir(book)
    book_dir.mkdir(exist_ok=True)
    file_path = book_dir / f"chapter_{chapter_data['order_no']}.m4a"

    if file_path.exists():
        print(f"Skipping existing file: {file_path}")
        return

    assert 'm4a' in chapter_data['signed_audio_url']
    response = scraper.get(chapter_data['signed_audio_url'])
    assert response.status_code == 200
    file_path.write_bytes(response.content)
    print(f"Downloaded chapter {chapter_data['order_no']}")

for locale in LOCALES:
    free_daily = get_free_daily(locale=locale)
    book = free_daily['book']
    print(f"Today's free daily in {locale} is: “{book['title']}”")

    # list of chapters without their content
    chapter_list = get_chapters(book['slug'])

    # fetch chapter content
    chapters = [get_chapter(book['id'], chapter['id']) for chapter in track(chapter_list, description='Fetching chapters…')]

    # download audio
    for chapter in track(chapters, description='Downloading audio…'):
        download_chapter_audio(book, chapter)

    # write markdown
    # excluded for brevity – just access chapter['text'] etc.
    # markdown_text = download_book_md(book, chapters)
Erik262 commented 2 years ago

@NicoWeio does your code work straight out of the box, or does this to be replaced with the core.py ?

WrayOfSunshine commented 2 years ago

Would this approach work on a Windows machine?

NicoWeio commented 2 years ago

@NicoWeio does your code work straight out of the box, or does this to be replaced with the core.py ?

See my earlier comment:

However, I haven't integrated my approach with this codebase, and I'm not sure if it works the same for arbitrary books on Blinkist Premium.

Assuming you have cloudscraper installed, my script works out of the box, and it should download the audio just fine. However, it does not generate a text or cover image file, does not set the audio's metadata, and does not precisely follow dailyblink's naming conventions.

NicoWeio commented 2 years ago

Would this approach work on a Windows machine?

If dailyblink worked on Windows before, yes. Both my approach using Blinkist's API and the current approach using BeautifulSoup.

Erik262 commented 2 years ago

@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.

ptrstn commented 2 years ago

@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.

This change requires some refactoring and a little bit more time than initially expected. I'll see what I can do. Can't guarantee you when though, since I've got other things in life to take care of first.

Erik262 commented 2 years ago

@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.

This change requires some refactoring and a little bit more time than initially expected. I'll see what I can do. Can't guarantee you when though, since I've got other things in life to take care of first.

sure you're right about that.

rajeshbhavikatti commented 2 years ago

Here you go. :)

Again, I haven't tried other values for User-Agent yet, and I can't check whether this approach will work for Premium content.

import cloudscraper
from datetime import datetime
from pathlib import Path
import requests
from rich import print
from rich.progress import track

BASE_URL = 'https://www.blinkist.com/'

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0',
    'x-requested-with': 'XMLHttpRequest',
}

LOCALES = ['en', 'de']
DOWNLOAD_DIR = Path.home() / 'Musik' / 'Blinkist'

scraper = cloudscraper.create_scraper()

def get_book_dir(book):
    return DOWNLOAD_DIR / f"{datetime.today().strftime('%Y-%m-%d')} – {book['slug']}"

def get_free_daily(locale):
    # see also: https://www.blinkist.com/en/content/daily
    response = scraper.get(
        BASE_URL + 'api/free_daily',
        params={'locale': locale}
    )
    return response.json()

def get_chapters(book_slug):
    url = f"{BASE_URL}/api/books/{book_slug}/chapters"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()['chapters']

def get_chapter(book_id, chapter_id):
    url = f"{BASE_URL}/api/books/{book_id}/chapters/{chapter_id}"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()

def download_chapter_audio(book, chapter_data):
    book_dir = get_book_dir(book)
    book_dir.mkdir(exist_ok=True)
    file_path = book_dir / f"chapter_{chapter_data['order_no']}.m4a"

    if file_path.exists():
        print(f"Skipping existing file: {file_path}")
        return

    assert 'm4a' in chapter_data['signed_audio_url']
    response = scraper.get(chapter_data['signed_audio_url'])
    assert response.status_code == 200
    file_path.write_bytes(response.content)
    print(f"Downloaded chapter {chapter_data['order_no']}")

for locale in LOCALES:
    free_daily = get_free_daily(locale=locale)
    book = free_daily['book']
    print(f"Today's free daily in {locale} is: “{book['title']}”")

    # list of chapters without their content
    chapter_list = get_chapters(book['slug'])

    # fetch chapter content
    chapters = [get_chapter(book['id'], chapter['id']) for chapter in track(chapter_list, description='Fetching chapters…')]

    # download audio
    for chapter in track(chapters, description='Downloading audio…'):
        download_chapter_audio(book, chapter)

    # write markdown
    # excluded for brevity – just access chapter['text'] etc.
    # markdown_text = download_book_md(book, chapters)

Executing this code on google colab I am getting 403 forbidden error on line 70 when calling get_chapters after troubleshooting I found that response.raise_for_status() gives that error as it can't access the url which gives this error. how can I resolve this? @NicoWeio msedge_ilk1LMQ7Fj

NicoWeio commented 2 years ago

@rajeshbhavikatti I just published my code here, so we can keep this issue clean from further discussions. Notice the double slash in the URL? That might be the cause, although it didn't cause issues for me. Maybe because of a different requests version? Anyway, I fixed the double slashes in my code. Plus, I've added CI to my repo, and it works just fine there, too.

kotzer3 commented 1 year ago

@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.

This change requires some refactoring and a little bit more time than initially expected. I'll see what I can do. Can't guarantee you when though, since I've got other things in life to take care of first.

Hi Peter @ptrstn , do you have some updates on this?

ptrstn commented 1 year ago

Hi Peter @ptrstn , do you have some updates on this?

I'll be able to work on it starting beginning of October, since I'm still busy with private issues

kotzer3 commented 1 year ago

Hi Peter @ptrstn , do you have some updates on this?

I'll be able to work on it starting beginning of October, since I'm still busy with private issues

Any News for us?

rajeshbhavikatti commented 1 year ago

Hi, I have made some updates based on this repo feel free to reach out to me on any changes or update check out my notebook here

Erik262 commented 1 year ago

@rajeshbhavikatti nice work, but you don't catch the mp3 files.

rajeshbhavikatti commented 1 year ago

@Erik262 yes, as the notion API doesn't support it yet