mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.67k stars 952 forks source link

How to generate metadata file for tweets without media? #570

Open Twi-Hard opened 4 years ago

Twi-Hard commented 4 years ago

How can I download the metadata without there being media in the tweets? This is the kind of data I'm talking about:

{ "category": "twitter", "content": "Of course, the day I go on Twitter to shame my Secret Santa for not getting me anything, this arrives.", "date": "2015-12-28 21:23:04", "extension": "jpg", "filename": "CXV7BZvUkAEJfH7", "num": 1, "retweet_id": 0, "retweeter": "", "subcategory": "media", "tweet_id": 681586186498908161, "user": "M_A_Larson", "user_id": 532975158, "username": "M.A.Larson" }

mikf commented 4 years ago

Not possible at the moment. Posts/Tweets/etc without media, regardless of the site, are ignored.

Scripter17 commented 3 years ago

Could I be so rude as to ask for this to be looked into?

I'm trying to go through the code to see what changes need to be made but it's taking a lot of effort

Edit: I managed to get this barely working by adding the following code to the start of DownloadJob.handle_directory in gallery-dl/jobs.py

        with open("./"+str(kwdict["tweet_id"])+".json", "w", encoding="utf-8") as fp:
            kwdictWriteable=kwdict
            kwdictWriteable["date"]="//TODO: THIS"
            kwdictWriteable["author"]["date"]="//TODO: THIS"
            fp.write(json.dumps(kwdictWriteable))

By "barely working" I mean it writes the contents of tweets without media to a file. This could be completely the wrong way to go about it, but it's progress

Scripter17 commented 3 years ago

Made a slightly less jank proof of concept to be added after tdata.update(metadata)

It's not clean in the slightest and it doesn't even format the json, but it works as a temporary machine-parseable solution. I just hope @mikf can do the proper implementation for me because I can't make heads or tails with this codebase

            # TEMPORARY AND JANK MEDIALESS TWEET SOLUTION
            import os, copy, datetime
            try: os.mkdir("gallery-dl/twitter")
            except: pass
            try: os.mkdir("gallery-dl/twitter/"+tdata["user"]["name"])
            except: pass
            tdataWriteable=copy.deepcopy(tdata)
            def deepClean(obj):
                for key in obj.keys():
                    if isinstance(obj[key], datetime.datetime):
                        obj[key]=obj[key].timestamp()
                    elif isinstance(obj[key], dict):
                        obj[key]=deepClean(obj[key])
                return obj
            open("gallery-dl/twitter/"+tdata["user"]["name"]+"/"+str(tdata["tweet_id"])+".json", "w").write(json.dumps(deepClean(tdata)))
            # TEMPORARY AND JANK MEDIALESS TWEET SOLUTION
mikf commented 3 years ago

Should now be possible by enabling the text-tweets option (https://github.com/mikf/gallery-dl/commit/724ca61f3600037cc57033891354cded10633079, https://github.com/mikf/gallery-dl/commit/b5affc62aa84847f3ac0c39eda675f3ced761a9f) and the right postprocessors settings:

    "twitter": {
        "text-tweets": true,
        "postprocessors": [
            {
                "name": "metadata",
                "event": "post",
                "filename": "{tweet_id}.json"
            }
        ]
    }

(see also https://github.com/mikf/gallery-dl/issues/1569#issuecomment-846428927)

Why does this need an extra option?

Because it would cause a lot of needless processing and path generation for data that gets discarded most of the time.

God-damnit-all commented 3 years ago

Text-only seems like a bit of a misnomer, it would imply that it wouldn't get media from tweets that have them. Perhaps "non-media" would be a better term? Or maybe "text-tweets"?

Scripter17 commented 3 years ago

I don't mean to be that guy, but if/when this gets implemented for other websites, it'd make more sense for it to be named include-medialess

KaMyKaSii commented 3 years ago

I would like this to be implemented on 4chan and similar sites. And I believe that several other people also have the same need on different sites, so maybe it would be good to add an option to save the metadata of posts without media regardless of the site?