knadh / tg-archive

A tool for exporting Telegram group chats into static websites like mailing list archives.
MIT License
885 stars 134 forks source link

No Images/Movies Downloaded since Oct 2020 #52

Closed Martin-Furter closed 2 years ago

Martin-Furter commented 2 years ago

I have setup tg-archive to download the channel MARKmobil just to have a backup of all his work.

The website can be found here: https://markmobil.borg.ch/telegram/2020-10.html#2020-10-06

The web pages generated from the downloaded data look fine until September 2020, the last picture i can see is from October 2020, and after that all movies and pictures are missing.

I can see many of the following error messages with differing media number: 2022-02-24 11:13:23,855: downloading media #2690 2022-02-24 11:13:23,856: Starting direct file download in chunks of 131072 at 0, stride 131072 2022-02-24 11:13:24,030: error downloading media: #2690: The file reference has expired and is no longer valid or it belongs to self-destructing media and cannot be resent (caused by GetFileRequest)

If I look into the channel using telegram-desktop I can still see the latest pictures.

knadh commented 2 years ago

2022-02-24 11:13:24,030: error downloading media: #2690: The file reference has expired and is no longer valid or it belongs to self-destructing media and cannot be resent (caused by GetFileRequest)

Ah, it looks like this has something to do with the Telegram API itself or the Telethon API client. I am afraid I wouldn't be able to debug this.

exactoph commented 2 years ago

Hello there,

same issue here with only some groups. However there is a workaround. If the message appears you can download the media by executing a sync for the specific ID.

e.g.: 2022-05-19 12:10:26,307: error downloading media: #117: The file reference has expired and is no longer valid or it belongs to self-destructing media and cannot be resent (caused by GetFileRequest)

$ tg-archive --sync -id 117 2022-05-20 09:16:24,767: downloading media #117 2022-05-20 09:16:24,789: Starting direct file download in chunks of 131072 at 0, stride 131072 2022-05-20 09:16:30,680: finished. fetched 1 messages. last message = 2022-05-15 07:32:18+00:00

So I'll search for failed media IDs after sync and execute the specific sync for each error afterwards. Maybe @knadh can add this logic to the repo?

knadh commented 2 years ago

Maybe @knadh can add this logic to the repo?

Will think about this.

A hacky way this can be achieved would be to do something like tg-archive --sync > sync.log and use shell scripting to grep the IDs of failed media and run sync on them again.

exactoph commented 2 years ago

Following works for me. Retrying on media download, as far as is can see all previous failed media downloads worked now.

Better way would probably be to except the error in the python code and react on it.

#!/bin/bash

# Setting up absolute path as cron can't find program ...
tg_archiver_bin='/usr/local/bin/tg-archive'

# Declaring empty array and adding values one by one for better readability.
tg_archive_paths=()

# Add paths one by one. Don't use ~ as this doesn't work in script!
tg_archive_paths+=("/path/to/tg-archive")

# Iterate over each directory, poll new messages and build new html pages
for tg_archive_path in ${tg_archive_paths[@]}
do
        echo "Updating directory '$tg_archive_path' ..."

        # Switching to directory as tg-archive doesn't work well with parameterized call ...
        cd "$tg_archive_path"
        if [[ $? -ne 0 ]]
        then
                echo -e "ERROR: Directory does not exist, skipping this one!\n"
                continue
        fi

        # Getting new messages and saving output to file
        $tg_archiver_bin --sync 2>&1 | tee output.txt

        if [[ $? -ne 0 ]]
        then
                rm output.txt
                echo -e "ERROR: Telegram sync on '$tg_archive_path' failed!\n"
                continue
        fi

        # Iterating over media download errors and trying to get them by single download
        error_media_numbers=`cat output.txt | grep "error downloading media" | awk '{ print $6 }' | sed 's/#//g' | sed 's/://g'`
        for error_media_number in $error_media_numbers
        do
                echo "Retrying failed media file #$error_media_number ..."
                $tg_archiver_bin --sync -id $error_media_number

                if [[ $? -ne 0 ]]
                then
                        echo -e "ERROR: Telegram sync on '$tg_archive_path' failed!\n"
                fi
        done

        rm output.txt

        # Building HTML content
        $tg_archiver_bin --build
        if [[ $? -ne 0 ]]
        then
                echo -e "ERROR: Page build on '$tg_archive_path' failed!\n"
                continue
        fi

        echo -e "Updated directory '$tg_archive_path'.\n"
done