mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.7k stars 953 forks source link

[twitter] `-o text-tweets=true` now downloads reply target's posts #2712

Closed AlttiRi closed 1 year ago

AlttiRi commented 2 years ago

The bug have appeared after the update.

I optionally use -o text-tweets=true only when I need to save no media tweet's text (in fact to download everything of the passed profile), since usually it is required only for some profiles.

alias gga='gallery-dl --download-archive ~/gallery-dl/gallery-dl.sqlite'
alias ggat='gallery-dl -o text-tweets=true --download-archive ~/gallery-dl/gallery-dl.sqlite'

The conf:

        "twitter":
        {
            "retweets": false,
            "directory": ["[gallery-dl]", "[{category}] {author[name]}"],
            "filename": "[{category}] {author[name]}—{date:%Y.%m.%d}—{retweet_id|tweet_id}—{filename}.{extension}",
            "size": ["orig", "4096x4096", "large", "medium", "small"],
            "fallback": false,
            "cards": false,
            "pinned": true,
            "replies": "self",
            "cookies": {
                "auth_token": "XXX"
            },
            "text-tweets": false,
            "postprocessors": [{
                "name": "mtime",
                "event": "post"
            }, {
                "directory": "metadata",
                "filename": "[{category}] {author[name]}—{date:%Y.%m.%d}—{retweet_id|tweet_id}.html",
                "name": "metadata",
                "event": "post",
                "mtime": true,
                "mode": "custom",
                "archive": "~/gallery-dl/gallery-dl-postprocessors.sqlite",
                "archive-format": "{tweet_id}_{retweet_id}_p1",
                "format": "<div id='{retweet_id|tweet_id}'><h4><a href='https://twitter.com/{author[name]}/status/{retweet_id|tweet_id}'>{retweet_id|tweet_id}</a> by <a href='https://twitter.com/{author[name]}'>{author[name]}</a></h4><div class='content'>{content}</div><hr><div>{date:%Y.%m.%d %H:%M:%S}</div><hr></div><br>"
            }]
        },

Now when I use ggat it downloads the ~retweets~ reply target posts, that is undesirable. gga works as expected (as earlier).

AlttiRi commented 2 years ago

Wait, it's not retweets.

It's media (and descriptions) of other profiles from the target profile replies to.

I want to download only content of the passed profile, without any content of other profiles.

nisehime commented 2 years ago

This is why I said to set the default value for replies option to self. Why my words were ignored...

To be precise, it must be set to timeline/replies extractor only I guess

Anyway, replies: "self" in your config

AlttiRi commented 2 years ago

I just passed https://twitter.com/profile link.

https://github.com/mikf/gallery-dl/blob/master/docs/configuration.rst#extractortwitterreplies Seems it's just a bug.

I want to download in this case all profile's media posts. And with -o text-tweets=true all profile's posts (media and non-media). Without any other profile's content.

Something like "related": ["retweets", "replies"] ([]/false — for none) — one key that will define which related profile's content will be downloaded would be more convenient for configuration I think.

nisehime commented 2 years ago

It's not a bug. This is how gallery-dl operates when you pass normal profile links. (https://github.com/mikf/gallery-dl/commit/915dba8345d3d457a80f08fb34d0409b00829444, https://github.com/mikf/gallery-dl/commit/0add1fc0908ae460da173e305bd8659632f6807b)

In your case when you set text-tweets=true gallery-dl uses replies timeline.

nisehime commented 2 years ago

Actually, I just tested and it definitely doesn't work right. For example: https://twitter.com/amanatsu_mikan7/with_replies (NSFW)

[gallery-dl][warning] logfile: missing or invalid path (expected str, bytes or os.PathLike object, not NoneType)
[gallery-dl][debug] Version 1.22.2 - Executable
[gallery-dl][debug] Python 3.7.9 - Windows-8.1-6.3.9600
[gallery-dl][debug] requests 2.28.0 - urllib3 1.26.9
[gallery-dl][warning] unsupportedfile: [Errno 2] No such file or directory: 'C:\\Users\\Madobe\\Desktop\\f\\logs\\unsupported.txt'
[gallery-dl][debug] Starting DownloadJob for 'https://twitter.com/amanatsu_mikan7/with_replies'
[twitter][debug] Using TwitterRepliesExtractor for 'https://twitter.com/amanatsu_mikan7/with_replies'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): twitter.com:443
[urllib3.connectionpool][debug] https://twitter.com:443 "GET /i/api/graphql/7mjxD3-C6BxitPMVQ6w0-Q/UserByScreenName?variables=%7B%22screen_name%22%3A%22amanatsu_mikan7%22%2C%22withSafetyModeUserFields%22%3Atrue%2C%22withSup
[urllib3.connectionpool][debug] https://twitter.com:443 "GET /i/api/graphql/t4wEKVulW4Mbv1P0kgxTEw/UserTweetsAndReplies?variables=%7B%22userId%22%3A%222402630918%22%2C%22count%22%3A100%2C%22withCommunity%22%3Atrue%2C%22incl
[twitter][debug] Using download archive './archive/twitter_db.sqlite3'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): pbs.twimg.com:443
[urllib3.connectionpool][debug] https://pbs.twimg.com:443 "GET /media/FWV5Ct8VsAAsXCo?format=jpg&name=orig HTTP/1.1" 200 79345
* .\galleries\twitter\tantou_KAI (2839069854)\[22-06-28] 1541771537174646784_p1.jpg
[twitter][debug] Skipping 1541773999071318016 (reply)
[urllib3.connectionpool][debug] https://pbs.twimg.com:443 "GET /media/FWU_0g7akAAJ5vr?format=jpg&name=orig HTTP/1.1" 200 113915
* .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-28] 1541708638997598208_p1.jpg
[twitter][debug] Skipping 1541709905710632960 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-28] 1541708638997598208_p1.jpg
[twitter][debug] Skipping 1541715055061807104 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-28] 1541708638997598208_p1.jpg
[twitter][debug] Skipping 1541714920911147009 (reply)
[urllib3.connectionpool][debug] https://pbs.twimg.com:443 "GET /media/FWVGQUuaMAAvq-m?format=jpg&name=orig HTTP/1.1" 200 260149
* .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-28] 1541715698661355520_p1.jpg
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-28] 1541708638997598208_p1.jpg
[twitter][debug] Skipping 1541714418962399234 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-28] 1541708638997598208_p1.jpg
[twitter][debug] Skipping 1541708825022976000 (reply)
[urllib3.connectionpool][debug] https://pbs.twimg.com:443 "GET /media/FWRN1t9aQAAhX6-?format=jpg&name=orig HTTP/1.1" 200 177351
* .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541442617300635648_p1.jpg
[twitter][debug] Skipping 1541705469923717128 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541442617300635648_p1.jpg
[twitter][debug] Skipping 1541531280139231232 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541442617300635648_p1.jpg
[twitter][debug] Skipping 1541520782773211137 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541442617300635648_p1.jpg
[twitter][debug] Skipping 1541485179201417216 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541442617300635648_p1.jpg
[twitter][debug] Skipping 1541463159822700544 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541442617300635648_p1.jpg
[twitter][debug] Skipping 1541458604124798977 (reply)
[urllib3.connectionpool][debug] https://pbs.twimg.com:443 "GET /media/FWTRtO1UUAEDD_j?format=jpg&name=orig HTTP/1.1" 200 196468
* .\galleries\twitter\Agovitch1 (1353631856877477889)\[22-06-28] 1541591372226248704_p1.jpg
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541442617300635648_p1.jpg
[twitter][debug] Skipping 1541455347461681153 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541442617300635648_p1.jpg
[twitter][debug] Skipping 1541449072736841734 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541442617300635648_p1.jpg
[twitter][debug] Skipping 1541449028252372993 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541442617300635648_p1.jpg
[twitter][debug] Skipping 1541449017174794240 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541442617300635648_p1.jpg
[twitter][debug] Skipping 1541448759649120256 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541442617300635648_p1.jpg
[twitter][debug] Skipping 1541448545836101632 (reply)
# .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541442617300635648_p1.jpg
[twitter][debug] Skipping 1541446637016723456 (reply)
[urllib3.connectionpool][debug] https://pbs.twimg.com:443 "GET /media/FWRS51CaMAE3JZk?format=jpg&name=orig HTTP/1.1" 200 237222
* .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541448131329822720_p1.jpg
[urllib3.connectionpool][debug] https://pbs.twimg.com:443 "GET /media/FWRS51JakAEM3XE?format=jpg&name=orig HTTP/1.1" 200 369312
  .\galleries\twitter\amanatsu_mikan7 (2402630918)\[22-06-27] 1541448131329822720_p2.jpg
KeyboardInterrupt
  1. This tweet https://twitter.com/miyabi_1_2015/status/1541772932879331328 isn't skipped, despite replies: "self" in the config. Probably because it has quoted tweet.
  2. You can see that tweet 1541708638997598208 and 1541442617300635648 attempted to be downloaded multiple times. But this problem is also on 1.22.1 on replies timeline
ponchojohn1234 commented 2 years ago

i've been having similar issues since the update to 1.22.2 and i've found out another weird thing, https://twitter.com/Anon2000000/status/1538118265335062528 this tweet is skipped if i login (eg. with -u and -p) but its handled fine if im not logged in maybe try running gallery-dl with "replies":"self" and without cookies to see how it behaves

ponchojohn1234 commented 2 years ago

for example running this: gallery-dl https://twitter.com/Anon2000000 --write-metadata ends with only one folder explorer_bgc8C6T4hd but running this: gallery-dl https://twitter.com/Anon2000000 -u xxxxxx -p xxxxxxxx --write-metadata ends with 213 folders, all of which are images that the user replied to explorer_9POG7xpSdQ both used the same config:

"twitter":
        {
            "pinned": true,
            "quoted": true,
            "videos": true,
            "cards": false,
            "conversations": false,
            "replies": "self",
            "retweets": true,
            "twitpic": false,
            "syndication": true,
            "archive-format": "{tweet_id}_{num}",
            "image": {
              "archive-format": "image{filename}"
            }
        }
ponchojohn1234 commented 2 years ago

it also seems similar to the problem in #2713 since all of these problems run back to getting data from user adjecent to the one specified with the link

nisehime commented 2 years ago

https://twitter.com/Anon2000000/status/1538118265335062528 this tweet is skipped if i login (eg. with -u and -p) but its handled fine if im not logged in

This tweet would also be skipped on previous versions (although you should confirm it yourself, with enabled retweets in config) Sorry, it won't since the tweet displays in the normal timeline, but that does not negate my point. I explained why it might happen here. While you're logged off twitter does not combine tweets into threads, so the tweet isn't skipped.

for example running this: gallery-dl https://twitter.com/Anon2000000 --write-metadata ends with only one folder

Same thing. Logged off = non-target user tweets are not included in the replies. Details.

None of these issues are related to the current version, they've been here for a long time.

ponchojohn1234 commented 2 years ago

good to know, i still don't know why it gets post from other users while logged in though, even if it's treating the replies as retweets or quote tweets it's still handling like individual post, so you get all the folders with them, that is the main problem at hand

Hrxn commented 2 years ago

None of these issues are related to the current version, they've been here for a long time.

Are you sure about that? I mean, what's the implication here, that "replies": "self" had this issue even before the changes from #2665 ?

Strange, because I've never experienced it here. "replies": "self" wouldn't result in "unexpected" replies (i.e. replies made actually by a different user), the only old issue was long threads/conversations, because they would be truncated by Twitter and thus could miss some replies here, as I understood it..

nisehime commented 2 years ago

I mean, what's the implication here, that "replies": "self" had this issue even before the changes from https://github.com/mikf/gallery-dl/issues/2665 ?

The issue with "self" before the change was that it would not download target user's tweet if it was a reply to other user. The issue with unexpected replies was occuring when replies: true

Hrxn commented 2 years ago

The issue with "self" before the change was that it would not download target user's tweet if it was a reply to other user.

Yes, but only for long conversations, i.e. long enough to get "truncated" by Twitter, as you've reported in another issue, if I'm not mistaken, right?

ponchojohn1234 commented 2 years ago

The issue with "self" before the change was that it would not download target user's tweet if it was a reply to other user. The issue with unexpected replies was occuring when replies: true

actually "replies": "self" it's leading to unexpected replies now in 1.22.2, replies: true worked fine before but it missed stuff instead, it didn't add unrelated tweets which the user replied to

ponchojohn1234 commented 2 years ago

and specifically, it leads to unexpected tweets only when logged in, it doesn't if there isn't a login which is also odd

nisehime commented 2 years ago

actually "replies": "self" it's leading to unexpected replies now in 1.22.2

Yeah, it seems you're right. So it wasn't fixed then.

Yes, but only for long conversations, i.e. long enough

No, any conversations. Also it affects search, for example, and probably even /media timeline. Basically:

tweet from user1 = {replied to: user1} - downloaded
tweet from user1 = {replied to: user2} - skipped
tweet from user1 = {replied to: user1, user2} - downloaded

Something like that.

ponchojohn1234 commented 2 years ago

it fixed something, because even while logged out i'm gettiing more images than before (like the ones in threads it was missing before) but for some reason i goes way off the handle if you're logged in and gets some tweets it shouldn't

mikf commented 2 years ago

The "unexpected tweets only when logged in" are unrelated to "replies": "self" and happen because Twitter expands conversions when logged in, as nisehime explained further up. Those tweets are not a reply, so replies doesn't even trigger for them.

I'm very much considering reverting 0add1fc0 and not using user_tweets_and_replies for user urls, since that seems to be the root cause of many problems.

Also, what about an option that basically does the same as --filter "author['id'] == <user id>", so it filters out any tweets not from the actual user?

ponchojohn1234 commented 2 years ago

i forgot about filters lol but yeah it seems that an option would be optimal since it can be headache having 300 unrelated files all of a sudden

nisehime commented 2 years ago

The problem is that https://twitter.com/Anon2000000/status/1541508842269425667 the tweet which is Anon2000000 replying to is not a reply itself so gallery-dl's replies behavior is not applied here.

Also, what about an option that basically does the same as --filter "author['id'] == ", so it filters out any tweets not from the actual user?

This is what I asked long time ago. But not as an option, rather it should be that target user's metadata would always be accessible in the keyword dictionary. So the people can write these filter's by themselves. That would also fix issues like these: https://github.com/mikf/gallery-dl/issues/2713

nisehime commented 2 years ago

I'm very much considering reverting https://github.com/mikf/gallery-dl/commit/0add1fc0908ae460da173e305bd8659632f6807b and not using user_tweets_and_replies for user urls, since that seems to be the root cause of many problems.

I think that's not a good idea. I suppose people expect from gallery-dl to gather all the tweets from the timeline, and not using replies timeline could miss quite a lot, considering that an average user is probably not even aware that using twitter.com/user and twitter.com/user/with_replies would make a difference

mikf commented 2 years ago

using twitter.com/user and twitter.com/user/with_replies would make a difference

Since https://github.com/mikf/gallery-dl/commit/0add1fc0908ae460da173e305bd8659632f6807b, gallery-dl uses twitter.com/user/with_replies (+ search) for twitter.com/user as input URL when (retweets or text-tweets) and replies are enabled, and that seems to cause a lot of problems.

not using replies timeline could miss quite a lot

You sure about that? I always thought that at least /media had everything up to a certain point in time, including media posted as reply.

ponchojohn1234 commented 2 years ago

/media does seems to do a good job getting tweets like this even when logged in so i guess that could be a work around https://twitter.com/Anon2000000/status/1538118265335062528

nisehime commented 2 years ago

You sure about that? I always thought that at least /media had everything up to a certain point in time, including media posted as reply.

Yes, as long as it's only /media it should be fine, but when people set retweets: true or text-tweets: true, so gallery-dl induced to use normal timeline, it would miss tweets. You can of course leave it to users to deal with it, since I'm probably the only person here who noticed all that problems with missing tweets in normal timelines, but some day someone would notice that too maybe, and the circle will close.

I also think solutions like using /media+/with_replies+/tweets timelines are ugly and not really practical. Especially when you have a quite large (500+) list of links to download. It creates unnecessary traffic and makes the whole process longer, even with all the skips of already downloaded content. Also, it can trigger twitter's time-out thing, though I'm not sure about that.

nisehime commented 2 years ago

Gallery-dl's keyword dictionary for twitter always contains author and user objects for tweet metadata, however the only case when those object are different are retweets. For the rest of the time they're identical, correct me if I'm wrong. So that's a field for improvement. Say, user object can always contain target user's metadata. Not sure how it should behave with subcategories wich don't have target user like search, though.

nisehime commented 2 years ago

Also, it won't make much sense without proposed thread expanding. I've seen you did an expand option and it seems to be working, but something needs to be done about repeated file download attempts. The only idea I have here is to temporarily keep their IDs in RAM.

mikf commented 2 years ago

Regarding this issue's original topic: I've put out another release that no longer uses the /with_replies endpoint and reverts to the same behavior as v1.22.1.

The changes from https://github.com/mikf/gallery-dl/commit/0add1fc0908ae460da173e305bd8659632f6807b should be re-applied at some point, but it was premature to do so in v1.22.2.

God-damnit-all commented 2 years ago

You sure about that? I always thought that at least /media had everything up to a certain point in time, including media posted as reply.

Yes, as long as it's only /media it should be fine, but when people set retweets: true or text-tweets: true, so gallery-dl induced to use normal timeline, it would miss tweets. You can of course leave it to users to deal with it, since I'm probably the only person here who noticed all that problems with missing tweets in normal timelines, but some day someone would notice that too maybe, and the circle will close.

I also think solutions like using /media+/with_replies+/tweets timelines are ugly and not really practical. Especially when you have a quite large (500+) list of links to download. It creates unnecessary traffic and makes the whole process longer, even with all the skips of already downloaded content. Also, it can trigger twitter's time-out thing, though I'm not sure about that.

So should I do a pass with retweets/text tweets disabled and then a pass with them enabled?

nisehime commented 2 years ago

So should I do a pass with retweets/text tweets disabled and then a pass with them enabled?

If you only need user's media content then leave both disabled or use twitter.com/user/media link (but gallery-dl won't automatically perform search for older tweets with the link).

If you want retweets too and don't want to miss media content, then do it with two passes.

If you want all text-tweets too... Well, that's complicated, the changes are reverted now, so you should use twitter.com/user/with_replies link and be prepared to face the problems caused by it, and keep in mind that some text-tweets are still going to be missed. Just like with /media, auto search for older tweets won't be performed. Alternatively or additionally, you can also manually apply search query for all user's tweets.

God-damnit-all commented 2 years ago

So should I do a pass with retweets/text tweets disabled and then a pass with them enabled?

If you only need user's media content then leave both disabled or use twitter.com/user/media link (but gallery-dl won't automatically perform search for older tweets with the link).

If you want retweets too and don't want to miss media content, then do it with two passes.

If you want all text-tweets too... Well, that's complicated, the changes are reverted now, so you should use twitter.com/user/with_replies link and be prepared to face the problems caused by it, and keep in mind that some text-tweets are still going to be missed. Just like with /media, auto search for older tweets won't be performed. Alternatively or additionally, you can also manually apply search query for all user's tweets.

Is there any way to disable the new search functionality? I already had that handled as part of my script and now it's probably going to lead to a lot of API requests I don't want it to do.

nisehime commented 2 years ago

Use /tweets link, like twitter.com/user/tweets. /media and /with_replies doesn't do search as I said.

AlttiRi commented 2 years ago

I'm lazy to read the messages above, that I would like:

of the passed profile only, by just passing a direct profile link.

Downloading of third-party retweets and commented third-party tweets is enabled with an extra config option.

If it will work such way from the box I think it would be convenient and intuitively.

3k2 commented 2 years ago

I just decided to downgrade to gallery-dl 1.21.2-1 for the time being until this can be solved or how to make the folders better structured.

God-damnit-all commented 2 years ago

@mikf All the strategy options include search, but I specifically want to avoid having it search under any circumstance. It's a lot of API calls and I handle searching elsewhere already.

mikf commented 2 years ago

nisehime already explained that in https://github.com/mikf/gallery-dl/issues/2712#issuecomment-1169374407

If you don't want a search, use twitter.com/USER/tweets, twitter.com/USER/media, and twitter.com/USER/with_replies.

The search only happens for direct user URLs (twitter.com/USER) since version 1.22.0 (915dba83) in the hopes of improving the default behavior for users that don't have any extra scripts.

strategy only applies for those direct URLs to give a bit more control and avoid issues like this one here.

God-damnit-all commented 2 years ago

nisehime already explained that in #2712 (comment)

If you don't want a search, use twitter.com/USER/tweets, twitter.com/USER/media, and twitter.com/USER/with_replies.

The search only happens for direct user URLs (twitter.com/USER) since version 1.22.0 (915dba8) in the hopes of improving the default behavior for users that don't have any extra scripts.

strategy only applies for those direct URLs to give a bit more control and avoid issues like this one here.

I see, I misunderstood what the new documentation was saying.

Does that mean the workaround for getting all the tweets I described here https://github.com/mikf/gallery-dl/issues/2712#issuecomment-1169247338 is still necessary?

If so, would it be possible for me to somehow retrieve the json of tweets for both with_replies and media, and then feed the list of urls back into gallery-dl? The new 'unique' setting seems like it would make it so it didn't have to check the same tweet twice.

God-damnit-all commented 2 years ago

@mikf Until such an option to disable search is implemented, I figure I should modify the code for a personal copy, but looking at the code, no easy way to do it really sticks out to me. What should I do?