mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.99k stars 977 forks source link

Some twitter posts are ignored #2875

Open rivke41levp656 opened 2 years ago

rivke41levp656 commented 2 years ago

There are some twitter posts that are embedded somehow such that gallery-dl does not detect them as media.

2 example video URLs:

https://twitter.com/AyakaOhashi/status/1555841160312025089
https://twitter.com/bang_dream_1242/status/1561548715348746241

Both of these can be downloaded via yt-dlp directly but gallery-dl ignores them with the following output:

click to expand ``` gallery-dl -v --ignore-config https://twitter.com/AyakaOhashi/status/1555841160312025089 [gallery-dl][debug] Version 1.23.0 [gallery-dl][debug] Python 3.10.6 - Linux-5.19.4-arch1-1-x86_64-with-glibc2.36 [gallery-dl][debug] requests 2.28.1 - urllib3 1.26.12 [gallery-dl][debug] Starting DownloadJob for 'https://twitter.com/AyakaOhashi/status/1555841160312025089' [twitter][debug] Using TwitterTweetExtractor for 'https://twitter.com/AyakaOhashi/status/1555841160312025089' [urllib3.connectionpool][debug] Starting new HTTPS connection (1): twitter.com:443 [urllib3.connectionpool][debug] https://twitter.com:443 "GET /i/api/graphql/ItejhtHVxU7ksltgMmyaLA/TweetDetail?variables=%7B%22focalTweetId%22%3A%221555841160312025089%22%2C%22with_rux_injections%22%3Afalse%2C%22withCommunity%22%3Atrue%2C%22withQuickPromoteEligibilityTweetFields%22%3Atrue%2C%22withBirdwatchNotes%22%3Afalse%2C%22includePromotedContent%22%3Afalse%2C%22withSuperFollowsUserFields%22%3Atrue%2C%22withBirdwatchPivots%22%3Afalse%2C%22withDownvotePerspective%22%3Afalse%2C%22withReactionsMetadata%22%3Afalse%2C%22withReactionsPerspective%22%3Afalse%2C%22withSuperFollowsTweetFields%22%3Atrue%2C%22withClientEventToken%22%3Afalse%2C%22withVoice%22%3Atrue%2C%22withV2Timeline%22%3Afalse%2C%22__fs_interactive_text%22%3Afalse%2C%22__fs_dont_mention_me_view_api_enabled%22%3Afalse%7D HTTP/1.1" 200 3893 [twitter][info] No results for https://twitter.com/AyakaOhashi/status/1555841160312025089 ```

and here is an image URL [https://twitter.com/bang_dream_1242/status/1561674543323910144]() that fails similarly.

Perhaps notably these posts do not appear under twitter.com/user/media.

nisehime commented 2 years ago

You need to enable cards in your config https://github.com/mikf/gallery-dl/blob/master/docs/configuration.rst#extractortwittercards

rivke41levp656 commented 2 years ago

I considered that but didn't realize it had the the 'ytdl' option. I had in my config cards set to true. After switching it to 'ytdl' the videos now work, but the image still fails :

downloader.ytdl: ERROR: [twitter] 1561674543323910144: No video formats found!; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
download: Failed to download ytdl:https://twitter.com/i/web/status/1561674543323910144

but there is no video in this tweet so it shouldn't pass to yt-dlp

nisehime commented 2 years ago

but the image still fails

Yeah, even if you set cards to true it still can't download it.

mikf commented 2 years ago

The card type from 1561674543323910144 is not yet supported by gallery-dl, hence it forwards it to ytdl. (It will be supported the next time I do a git push)

rivke41levp656 commented 2 years ago

OK, sounds good. But I remember now why I didn't have cards set to "ytdl". The problem is some people post youtube embeds so just running gallery-dl twitter.com/user could result in it downloading dozens of hours of youtube videos instead of just a twitter feed. Is there any way to filter these out? I could use --filesize-max but that still downloads some data. I want to download only the cards that have native content on twitter, not those that link elsewhere.

Hrxn commented 2 years ago

What about "cards": true and "videos": "ytdl"?

rivke41levp656 commented 2 years ago

What about "cards": true and "videos": "ytdl"?

That's my current setup. It won't download the videos in the OP, I guess because they're unsupported so gallery-dl doesn't recognize them as videos to begin with.

mikf commented 2 years ago

I looked a bit more into this and the video from https://twitter.com/bang_dream_1242/status/1561548715348746241 is easy enough to support as well.

https://twitter.com/AyakaOhashi/status/1555841160312025089 on the other hand is a "broadcast"/livestream and requires m3u8/HLS support, so gallery-dl cannot download it without ytdl.

I guess I could implement some sort of card filter option with which it would be possible to ignore unwanted cards like YT embeds.

Hrxn commented 2 years ago

I guess I could implement some sort of card filter option with which it would be possible to ignore unwanted cards like YT embeds.

Sounds great to me!

mikf commented 2 years ago

General support for all(?) unified cards got added in commit https://github.com/mikf/gallery-dl/commit/4d7cb0bf56e60a9aa93fd3e3430c282f72735a33, meaning it now also downloads image_website, video_website, etc. cards. (https://twitter.com/bang_dream_1242 has quite a lot of them)

There's now also a cards-blacklist option (https://github.com/mikf/gallery-dl/commit/4d78ca89dbea05ee22a8eebbd14d7d0b73ff7fcd) YT embed cards are of type player, but so are probably a lot of other external video sites.

Hrxn commented 2 years ago

Would it be a good idea to write the URLs blocked by "cards-blacklist": ["player"], for example, to the unsupported file? Or maybe to the log?

mikf commented 2 years ago

gallery-dl does not know which URL a specific card has before applying cards-blacklist, and I don't think there is an easy way of extracting this value. Each specific card type can be very different.

I could implement this functionality for player cards specifically, but there might be several others where this would also be necessary and three lines of code could become 50 or 100 just to get an URL and I don't know if I want to do that.

Hrxn commented 2 years ago

Oh, okay, I just assumed that the URL would already be known somehow for the cards, and that different card types would be rather similar? Apparently not.

I mean, the primary use case would only be the player cards anyway here, in order to avoid embedded YouTube clips (as mentioned in this comment), because as I've discovered myself, these can get really huge..

Of course, just a suggestion, simply disregard.

biggestsonicfan commented 2 years ago

Be warned with some filters on some users. For example gallery-dl "https://twitter.com/search?q=from:S_ABOTEN__ filter:links" only grabs a single file where gallery-dl https://twitter.com/S_ABOTEN__ grabs all 30.

mikf commented 2 years ago

@Hrxn I've updated the cards-blacklist option a bit to where it should now be possible to ignore youtube videos by simply specifying "youtube.com" or "player:youtube.com" in the list. It depends on the vanity_url value of a card, which should be present for at least all player cards. (https://github.com/mikf/gallery-dl/commit/e99a9b2afffc673ecc632c9d88ceec3af8fde1da)