timhutton / twitter-archive-parser

Python code to parse a Twitter archive and output in various ways
GNU General Public License v3.0
2.4k stars 111 forks source link

Clean up the names and initializations of path variables #115

Closed lenaschimmel closed 1 year ago

lenaschimmel commented 1 year ago

I've noticed that there's a dozen of path variables at the start of main which do not follow a common naming pattern, and the way that their initializations reference each other makes it difficult to understand what is where:

    input_folder = '.'
    output_media_folder_name = 'media/'
    tweet_icon_path = f'{output_media_folder_name}tweet.ico'
    output_html_filename = 'TweetArchive.html'
    data_folder = os.path.join(input_folder, 'data')
    account_js_filename = os.path.join(data_folder, 'account.js')
    log_path = os.path.join(output_media_folder_name, 'download_log.txt')
    output_following_filename = 'following.txt'
    output_followers_filename = 'followers.txt'
    user_id_URL_template = 'https://twitter.com/i/user/{}'
    dm_output_filename_template = 'DMs-Archive-{}.md'

They end in five different postfixes, but they all represent a path (or path template), either pointing to a regular file or a directory:

When I implement #99, I need to work a lot with those paths, and because the initialization code needs to access the old paths to move things over to the new paths, I'll have to add even more of those variables. It's hard to find matching names for them right now.

So what do you think of a common naming pattern like this? I'll replicate the important part of my change here, since it's kinda buried at the bottom of the diff. This change does not affect how the program works, everything should still be in the same place. (So this does not solve #99, not even partially, it's just preparation for a cleaner implementation of #99.)

    dir_path_archive             = '.'                                                              # used to be `input_folder`
    dir_path_output_media        = os.path.join(dir_path_archive,       'media')                    # used to be `output_media_folder_name`
    dir_path_input_data          = os.path.join(dir_path_archive,       'data')                     # used to be `data_folder`
    file_path_output_html        = os.path.join(dir_path_archive,       'TweetArchive.html')        # used to be `output_html_filename`
    file_path_output_following   = os.path.join(dir_path_archive,       'following.txt')            # used to be `output_following_filename`
    file_path_output_followers   = os.path.join(dir_path_archive,       'followers.txt')            # used to be `output_followers_filename`
    file_path_template_dm_output = os.path.join(dir_path_archive,       'DMs-Archive-{}.md')        # used to be `dm_output_filename_template`
    file_path_account_js         = os.path.join(dir_path_input_data,    'account.js')               # used to be `account_js_filename`
    file_path_download_log       = os.path.join(dir_path_output_media,  'download_log.txt')         # used to be `log_path`
    file_path_tweet_icon         = os.path.join(dir_path_output_media,  'tweet.ico')                # used to be `tweet_icon_path`

I'm not always a fan of identifier prefixes like dir_path_ but I think it makes it much easier to see what's the same and what's different between adjacent lines.

The comments like # used to be log_path are there to help others resolve merge problems. Later, we won't need the old names for anything and could delete the comments.

Also, I think these path variables are conceptually immutable and globally valid, so I'm not very happy with the long argument lists in function calls just to pass them around. I'd really like to make them globally accessible, but I'm not sure what's the best way to do that in python. True global variables? Put them in a dict and make that dict global? Or pass that dict around, instead of several individual vars? Singleton?!

Depending on that, I might also change the variable names to screaming snake case, i.e. FILE_PATH_ACCOUNT_JS instead of file_path_account_js.

timhutton commented 1 year ago

For passing around groups of related variables (e.g. paths): a dataclass.

lenaschimmel commented 1 year ago

Okay, I think this is at nice as it gets. I used a frozen dataclass, so that paths cannot be modified accidentally during execution. And I shortened most identifiers, compared to my initial proposal (path.file_path_output_html does not need to have path twice).

BTW, dataclass is included since Python 3.7, and the script checks for version >= 3.6 at the beginning. I think that with the line import_module('dataclasses') it should also work on python 3.6, but I do not have a machine with 3.6 to test it.

lenaschimmel commented 1 year ago

Since I merged the newest commits from main into this branch, paths.file_output_html is no longer used, since parse_tweets now generates its own filenames. I would keep it that way for the moment, and deal with it during #99, ok?

lenaschimmel commented 1 year ago

FYI, I'll be offline the next ~ 8 to 10 hours...

timhutton commented 1 year ago

OK, nice work on this. Thanks!