mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.46k stars 939 forks source link

Format -j like a metadata postprocessor would? #2883

Closed Twi-Hard closed 2 years ago

Twi-Hard commented 2 years ago

I'm trying to create a jsonl file for each twitter user with no media. If I try to just "download" the account with --no-download it is MUCH slower because it pauses at each mention of a media file rather than just fetching a full page of results. I'm also getting an error when trying to do it that way (TypeError: MetadataPP.run() takes 2 positional arguments but 3 were given). I just want to quickly export every tweet from an account to a single file (I download the media with wget so I'm not limited by the api, also it might be higher quality because I get the png version of each image). I used to use snscrape for this but more and more stuff is becoming inaccessible with it and the developer will never add authentication. Is this possible? Thanks :)

mikf commented 2 years ago

I don't think this is possible with gallery-dl alone. You probably have to use jq to extract the metadata portion from the entries produced by -j, and also --filter False to ignore any media.

gallery-dl -j --filter False twitter.com/USER | jq '[ .[][1] ]' > data.json

Another possibility could be writing all Tweet data to individual files with a metadata post processor and then somehow combining them.


I'm also getting an error when trying to do it that way (TypeError: MetadataPP.run() takes 2 positional arguments but 3 were given).

Are you using a metadata PP with event: finalize or how exactly did you get this error?

I download the media with wget so I'm not limited by the api

Could you elaborate a bit further? What do you do differently than gallery-dl?

also it might be higher quality because I get the png version of each image

By using ?format=png for each image URL?

Twi-Hard commented 2 years ago

gallery-dl -j --filter False twitter.com/USER | jq '[ .[][1] ]' > data.json

I need the media links so I can download them so I need to run it without that filter. Why are the links outside of the metadata? Having each tweet being a single json (so it can be a jsonl file) with the media url inside would be great. I can never figure out how to do stuff like that with jq. I just keep guessing.

Are you using a metadata PP with event: finalize or how exactly did you get this error?

I was trying to use the "event": "finalize" because I was trying to find an event that would put all of the metadata in one file.

Could you elaborate a bit further? What do you do differently than gallery-dl?

I download the images directly with wget with a concurrency of 5. They downloaded extremely fast and I could download thousands for hours straight without issues (no waiting because of limits at all). Getting the metadata without downloading anything in gallery-dl is fast but trying to download media makes me wait for most of the time.

By using ?format=png for each image URL?

My script uses uses https://pbs.twimg.com/media/AAAAAAAAAAAAAAA?format=png&name=4096x4096 (replace the "A"s with the image ID) Every single image has a png version. All of the images I tested that were officially uploaded at their original resolution elsewhere had the same resolution and file size. (all of the tested images were pngs to start with). There might be a way to detect what the original format was, I don't know.

Here's the script I made for this. I added comments in case you wanted to look at it. I renamed the default directory for privacy reasons. This is one of my first scripts so it might be bad, I don't know. It worked perfectly until I couldn't access a large amount of what I was trying to get because snscrape doesn't support authentication.

By the way, if you're able to fetch more metadata than what's currently fetched that would be great. I'm using twarc2 to use the official API to get some different metadata (with twarc2 hydrate input-ids.txt output-jsons.jsonl).

#!/bin/bash

# prerequisites:
# python3 -m pip install --force-reinstall https://github.com/yt-dlp/yt-dlp/archive/master.tar.gz
# pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git
# sudo apt install moreutils
# pip install --upgrade twarc
# twarc2 configure

directory="/example/path/"

# fix issue (I can't remember what)
cat $1 | sed '/^$/d' > "${directory}"input.txt

while read -r account || [[ -n "${account}" ]];
do
handle=$(echo "${account}" | sed -nr 's%((https:\/\/)?(www\.)?twitter\.com\/)?([a-zA-Z0-9_]{4,15})%\4%p')
echo Downloading "${handle}"

date=$(date '+%Y_%m_%d')

# scrape user entity and merge with existing entites
snscrape -v --retry 10 --jsonl --max-results 0 --with-entity twitter-profile "${handle}" >> "${directory}"twitter-"${handle}"-entity.jsonl
cat "${directory}"twitter-"${handle}"-entity.jsonl | sed -r "$ s|^\{|\{\"_date\": \"$date\", |" | sponge "${directory}"twitter-"${handle}"-entity.jsonl

# download new tweets from profile, add only new tweets to existing jsonl, sort by date
if [ -f "${directory}"twitter-"${handle}"-profile.jsonl ]; then
    snscrape -v --retry 10 --jsonl twitter-profile "${handle}" > "${directory}"twitter-"${handle}"-profile.jsonl.temp
    jq -r '.id' "${directory}"twitter-"${handle}"-profile.jsonl.temp > "${directory}""${handle}"-profile-new_tweets.txt
    jq -r '.id' "${directory}"twitter-"${handle}"-profile.jsonl > "${directory}""${handle}"-profile-old_tweets.txt
    grep -vf "${directory}""${handle}"-profile-old_tweets.txt "${directory}""${handle}"-profile-new_tweets.txt > "${directory}"temp-"${handle}"-profile-new_tweets-filtered.txt
    grep -f "${directory}"temp-"${handle}"-profile-new_tweets-filtered.txt "${directory}"twitter-"${handle}"-profile.jsonl.temp >> "${directory}"twitter-"${handle}"-profile.jsonl
    mv "${directory}"twitter-"${handle}"-profile.jsonl "${directory}"twitter-"${handle}"-profile.jsonl.temp &>/dev/null
    jq -s -c 'sort_by(.date)[]' "${directory}"twitter-"${handle}"-profile.jsonl.temp > "${directory}"twitter-"${handle}"-profile.jsonl
    rm "${directory}"twitter-"${handle}"-profile.jsonl.temp &>/dev/null
    rm "${directory}""${handle}"-profile-new_tweets.txt "${directory}""${handle}"-profile-old_tweets.txt "${directory}"temp-"${handle}"-profile-new_tweets-filtered.txt &>/dev/null
else 
    snscrape -v --retry 10 --jsonl twitter-profile "${handle}" > "${directory}"twitter-"${handle}"-profile.jsonl
fi

# download new tweets from search, add only new tweets to existing jsonl, sort by date
if [ -f "${directory}"twitter-"${handle}"-user.jsonl ]; then
    snscrape -v --retry 10 --jsonl twitter-user "${handle}" > "${directory}"twitter-"${handle}"-user.jsonl.temp
    jq -r '.id' "${directory}"twitter-"${handle}"-user.jsonl.temp > "${directory}""${handle}"-user-new_tweets.txt
    jq -r '.id' "${directory}"twitter-"${handle}"-user.jsonl > "${directory}""${handle}"-user-old_tweets.txt
    grep -vf "${directory}""${handle}"-user-old_tweets.txt "${directory}""${handle}"-user-new_tweets.txt > "${directory}"temp-"${handle}"-user-new_tweets-filtered.txt
    grep -f "${directory}"temp-"${handle}"-user-new_tweets-filtered.txt "${directory}"twitter-"${handle}"-user.jsonl.temp >> "${directory}"twitter-"${handle}"-user.jsonl
    mv "${directory}"twitter-"${handle}"-user.jsonl "${directory}"twitter-"${handle}"-user.jsonl.temp &>/dev/null
    jq -s -c 'sort_by(.date)[]' "${directory}"twitter-"${handle}"-user.jsonl.temp > "${directory}"twitter-"${handle}"-user.jsonl
    rm "${directory}"twitter-"${handle}"-user.jsonl.temp &>/dev/null
    rm "${directory}""${handle}"-user-new_tweets.txt "${directory}""${handle}"-user-old_tweets.txt  "${directory}""${handle}"-new_tweets-filtered.txt &>/dev/null
else 
    snscrape -v --retry 10 --jsonl twitter-user "${handle}" > "${directory}"twitter-"${handle}"-user.jsonl
fi
wait

# get tweet ids from urls since tweet ids aren't valid because they're so long
cat "${directory}"twitter-"${handle}"-profile.jsonl "${directory}"twitter-"${handle}"-user.jsonl | jq -r "[.][].url" | sort | uniq | grep -o '[0-9]*$' >> "${directory}"twitter-"${handle}"-ids.txt
cat "${directory}"twitter-"${handle}"-ids.txt | sort | uniq | sponge "${directory}"twitter-"${handle}"-ids.txt

# fetch metadata of all tweets using the offical api (different metadata)
twarc2 hydrate "${directory}"twitter-"${handle}"-ids.txt "${directory}"twitter-"${handle}"-twarc.jsonl
wait

# download banner with unique name so old versions aren't overwritten
banner_url=$(tail -1 "${directory}"twitter-"${handle}"-entity.jsonl | jq -r '.profileBannerUrl')
banner_id=$(echo "${banner_url}" | awk -F '/' '{print $NF}')
wget -nc -O "${directory}"twitter-"${handle}"-banner-"${banner_id}".jpg "${banner_url}"
wait

# download profile pic with unique name so old versions aren't overwritten
profile_pic_url=$(tail -1 "${directory}"twitter-"${handle}"-entity.jsonl | jq -r '.profileImageUrl')
profile_pic_id=$(echo "${profile_pic_url}" | awk -F '/' '{print$5}')
echo "${profile_pic_url}" | sed 's/_normal//' | xargs -I% wget -nc -O "${directory}"twitter-"${handle}"-avatar-"${profile_pic_id}".jpg %
wait

# create media folder if media exists in jsonls
if (( $(echo "${video_lines}""${image_lines}" | wc -l ) > 0 )); then mkdir "${directory}"twitter-"${handle}"-media &>/dev/null; fi

# extract media links
cat "${directory}"twitter-"${handle}"-profile.jsonl "${directory}"twitter-"${handle}"-user.jsonl | sed -nr 's%(.*https://pbs\.twimg\.com\/media/)([a-zA-Z_0-9]{15}).*%\2%p' | sort | uniq | sed '/^$/d' > "${directory}"temp-"${handle}"-image-ids.txt
cat "${directory}"twitter-"${handle}"-profile.jsonl "${directory}"twitter-"${handle}"-user.jsonl | grep -o 'https[:/a-zA-Z0-9._-]*\.m3u8' > "${directory}"temp-"${handle}"-video-urls.txt
cat "${directory}"temp-"${handle}"-video-urls.txt | sed -nr 's%(.*)(/)([_a-zA-Z0-9-]*)(\.m3u8)$%\3%p' | sort | uniq > "${directory}"temp-"${handle}"-video_ids.txt

# extract ids from yt-dlp archive
cat "${directory}"twitter-"${handle}"-media/_archive.txt | grep -oE '[_a-zA-Z0-9-]{10,20}' | sort | uniq > "${directory}"temp-"${handle}"-ytdl_archive_ids.txt

# extract only new ids/links
grep -vf "${directory}"temp-"${handle}"-ytdl_archive_ids.txt "${directory}"temp-"${handle}"-video_ids.txt > "${directory}"temp-"${handle}"-new_video_ids.txt
grep -f "${directory}"temp-"${handle}"-new_video_ids.txt "${directory}"temp-"${handle}"-video-urls.txt > "${directory}"temp-"${handle}"-new_video_urls.txt

# echo video count
old_video_count=$(cat "${directory}"twitter-"${handle}"-media/_archive.txt | wc -l)
new_video_count=$(cat "${directory}"temp-"${handle}"-video_ids.txt | wc -l)
echo "${old_video_count} videos in _archive.txt"
echo "${new_video_count} videos to be downloaded"

# parallel download maths 
video_lines=$(cat "${directory}"temp-"${handle}"-new_video_urls.txt | wc -l)
video_mod=$(($video_lines % 5))
video_per=$(($video_lines / 5))

image_lines=$(cat "${directory}"temp-"${handle}"-image-ids.txt | wc -l)
image_mod=$(($image_lines % 5))
image_per=$(($image_lines / 5))

# download videos
function download_videos() {
    for ((i = $video_per * $1 + 1  ; i <= ($video_per * $1) + $video_per ; i++)); do
        video_url=$(sed -n ''$i'p' "${directory}"temp-"${handle}"-new_video_urls.txt)
        yt-dlp --write-info-json -o "${directory}twitter-${handle}-media/%(id)s.%(ext)s" "${video_url}" --download-archive "${directory}"twitter-"${handle}"-media/_archive.txt
    done
}

function download_videos_remainder() {
    for ((i = $video_per * 5 + 1 ; i <= ($video_per * 5) + $video_mod ; i++)); do
        video_remainder_url=$(sed -n ''$i'p' "${directory}"temp-"${handle}"-new_video_urls.txt)
        yt-dlp --write-info-json -o "${directory}twitter-${handle}-media/%(id)s.%(ext)s" "${video_remainder_url}" --download-archive "${directory}"twitter-"${handle}"-media/_archive.txt
    done
}

download_videos 0 &
download_videos 1 &
download_videos 2 &
download_videos 3 &
download_videos 4
if [[ $video_mod != 0 ]]; then
    download_videos_remainder
fi
wait

# download images
function download_images() {
    for ((i = $image_per * $1 + 1  ; i <= ($image_per * $1) + $image_per ; i++)); do
        image_id=$(sed -n ''$i'p' "${directory}"temp-"${handle}"-image-ids.txt)
        echo $image_id | sed 's%^%https://pbs.twimg.com/media/%' | sed 's%$%?format=png\&name=4096x4096%' | xargs -I% wget -nc -O "${directory}"twitter-"${handle}"-media/"$image_id".png %
    done
}

function download_images_remainder() {
    for ((i = $image_per * 5 + 1 ; i <= ($image_per * 5) + $image_mod ; i++)); do
        image_id=$(sed -n ''$i'p' "${directory}"temp-"${handle}"-image-ids.txt)
        echo $image_id | sed 's%^%https://pbs.twimg.com/media/%' | sed 's%$%?format=png\&name=4096x4096%' | xargs -I% wget -nc -O "${directory}"twitter-"${handle}"-media/"$image_id".png %
    done
}

download_images 0 &
download_images 1 &
download_images 2 &
download_images 3 &
download_images 4
if [[ $image_mod != 0 ]]; then
    download_images_remainder
fi
wait

# remove temp files
rm "${directory}"twitter-"${handle}"-media/.png &>/dev/null
rm "${directory}"twitter-"${handle}"-banner-.jpg &>/dev/null
rm "${directory}"temp-"${handle}"-image-ids.txt &>/dev/null
rm "${directory}"temp-"${handle}"-video-urls.txt &>/dev/null
rm "${directory}"temp-"${handle}"-profile-new_tweets-filtered.txt &>/dev/null
rm "${directory}"temp-"${handle}"-user-new_tweets-filtered.txt &>/dev/null
rm "${directory}"temp-"${handle}"-ytdl_archive_ids.txt &>/dev/null
rm "${directory}"temp-"${handle}"-video_ids.txt &>/dev/null
rm "${directory}"temp-"${handle}"-new_video_ids.txt &>/dev/null
rm "${directory}"temp-"${handle}"-new_video_urls.txt &>/dev/null
done < $1
rm "${directory}"input.txt
Twi-Hard commented 2 years ago

I did some testing and it seems gallery-dl isn't getting the highest quality images. This account only posts in high resolution pngs https://twitter.com/DoshNSFW gallery-dl downloads nearly all of them as jpg. If you use the url trick I mentioned earlier (?format=png&name=4096x4096) the sizes/quality match the originals that the artist posts on other platforms. I've tested the url trick on many accounts in the past and it seems there's a png version of every single image. I tested a large amount of the downloaded images with the cli tool trid and it verified they are actual pngs, not just jpgs with a png extension. If gallery-dl could get the higher quality images I won't need some complicated workaround to get them and wouldn't need one giant json file with all of the tweets.

Here's the size difference

❯ du -cs --si DoshNSFW-gallery-dl DoshNSFW-png
76M     DoshNSFW-gallery-dl
147M    DoshNSFW-png

and the image count (I couldn't get all of the media urls from gallery-dl -j URL for some reason?). I did only get 147 links from gallery-dl -j URL (so it's not that I had 152 links and it only downloaded 147). This file count is only media files, no jsons.

❯ ls DoshNSFW-gallery-dl | wc -l
152
❯ ls DoshNSFW-png | wc -l
147
mikf commented 2 years ago

Every single image has a png version

It does, but, as it turns out, the encoded pixel data of such a PNG is identical to the JPG version.

For example https://pbs.twimg.com/media/EYoJVsYX0AApnb-?format=jpg&name=orig (63KB) and https://pbs.twimg.com/media/EYoJVsYX0AApnb-?format=png&name=4096x4096 (595KB)

Comparing these two with magick compare shows no difference a

... while re-encoding the PNG version as JPG and comparing with the original JPG yields obviously lots of differences b

Converting both JPG and PNG to BMP also produces identical files.

Same is true for all JPG pictures from DoshNSFW (or in general) downloaded by gallery-dl. It would be nice if there were higher quality image versions available, but those are sadly just JPGs encoded as PNG, resulting in bigger files with the same quality.

If gallery-dl could get the higher quality images I won't need some complicated workaround to get them and wouldn't need one giant json file with all of the tweets.

If you want, I can implement an option to force format=png for all Twitter images, but as I said, it would be kind of pointless and even wasteful.


Regarding your initial question, aka "Format -j like a metadata postprocessor would": Use a metadata post processor with "filename": "-". That will print one JSON object per line and should make it possible to combine it all into one file. Combine this with the url-metadata option to include the download URL into each metadata dict, and disable the default gallery-dl output with output.mode.

Twi-Hard commented 2 years ago

I tried using oxipng to make the compression and metadata identical from both twitter's png version of an image and the original png file uploaded elsewhere and they are different files. magick compare showed they were different too. This is really disappointing. At least I know what's best now. I will just download everything with gallery-dl normally I guess. Does gallery-dl no longer have to wait for the API? I remember it stopping all the time to wait to be allowed to do more requests. That was probably a couple years ago. I just tried downloading an account with ~25,000 images (only about 15,000 were downloadable) and it finished in 1 hour so I assume it wasn't rate limited. snscrape isn't limited by the API because it uses twitter's search instead of the normal public api. I know gallery-dl uses search now too. Is gallery-dl doing anything that might get rate limited basically? The list of accounts I want to download is very large so I hope to avoid api rate limiting. Also, thanks for all the help :) I wish I could help more but I don't know how to code and it seems to me like a beginner coder trying to help would do more harm than good because the code would need to be fixed.

mikf commented 2 years ago

There is still a rate limit for most API requests, but I think Twitter upped the amount of possible requests from ~200 to 500 per 15min window. It might also have been only a 30min window in the past. gallery-dl now uses the results from the media timeline, which can get rate limited, followed by a search starting from the last media Tweet, which has no rate limit. Basically the same strategy as twMediaDownloader, albeit with slightly different API endpoints and search filters.

Twi-Hard commented 2 years ago

That should be all for this issue. Thanks :)