richfromm / slack2discord

A Discord client that imports Slack-exported JSON chat history to Discord channel(s).
GNU General Public License v3.0
8 stars 3 forks source link

downloader.py -- replace the local file with a default image instead of crashing when an image isn't found #21

Closed ajroberts0417 closed 1 year ago

ajroberts0417 commented 1 year ago

This isn't the cleanest code, I kind of hacked it together, but it worked on my end to fix the script. When I was trying to import my slack archives, I kept running into issues where it couldn't find the file, resulting in an HTTP 404 crash.

richfromm commented 1 year ago

Before I get into the details of reviewing the code, do you have any idea why it wasn't able to find the file? Were some of your images found (the majority?), just not all of them? Is the behavior consistent -- if you run it repeatedly, is it always the same images that aren't found? Can you discern anything about what was or wasn't found, if that is the case? Like is there maybe a cutoff that images older than a certain timespan aren't available? It would be nice to understand any limitations in the export. Slack's documentation is lacking in details. For the ones that aren't found, are they visible when viewing the corresponding message in Slack?

Did you take a look at the export and look at the URLs there, and try to fetch anything manually, to see if perhaps the root cause is an error in my parsing? Within the list of files, for each file there are two possible attributes we could be using, url_private and url_private_download. In all of my (admittedly limited) test cases, I could find no difference between the two, so I'm using url_private. But maybe there are some cases in which it matters? And maybe a better strategy would be to try url_private, and if that fails, try url_private_download instead? Are you able to point me to a specific testcase that's failing that I could experiment with? (And if you're not comfortable pasting such a URL here from your slack that the entire public Internet can see, let me know and we can arrange some other way to communicate it.)

Another possibility is that maybe something is buggy (or just insufficient to capture all possibilities beyond what I personally encountered) in the way I'm unescaping the URL in ParsedMessage.add_file()

Or perhaps slack changed something in their export format. A JSON snippet of the entire "files" list containing a failing case might be helpful for comparing.

richfromm commented 1 year ago

Another reason I'd love to see the actual export JSON, I'm wondering if there's a sign of something different, take a look at https://github.com/richfromm/slack2discord/pull/20 , which shows files with an undocumented tombstone attribute.

Maybe your files are in a state where if you regenerated the export now, this is what you'd get, and slack deleted the files sometime between you making the export and now?

UPDATE: Another possible failure mode I can think of. Each Slack export generates a token. You can see these at https://myslackname.slack.com/services/export?all_tokens=1 in the section "Your Workspace’s Export File Download Tokens". This is part of the download URL, with the t attribute (...?t=xoxe-...). If you revoked a token (or maybe they expire?), that might lead to a 404 (I haven't tested this).

richfromm commented 1 year ago

There are a few separate questions here.

(1) What is the overall way we should react to such an error? The existing code has a fairly blunt raise_for_status() call, which translates HTTP errors into Python exceptions, to intentionally fail fast. I debated a bit over this, but eventually decided to do it b/c I didn't feel I could adequately anticipate or test all possible failure modes, and I'd rather just fail and have the user investigate further.

Note that I do all of the parsing and transforming before posting anything to Discord. Moving somewhat in the direction of idempotency (even if it's not truly idempotent). This allows you to first run with --dry-run and see if everything before the Discord posting is going to work.

So one question is should we:

You could envision providing 2 or 3 of these as options, possibly selectable via command line options (e.g. --file-not-found=fail, --file-not-found=default, --file-not-found=ask. If so, then there's a question of what the default state is. Or maybe that's overkill.

(2) What to do if we want to automatically recover from a failure. Your proposal is to attach a default image. I think my main problem with that is that files can be arbitrary types of files, not just default images. So I think we should probably either:

Discord appears to discern file type purely based on filename extension. While Slack has various related properties (mimetype, filetype, pretty_type), there is no formal way to add a MIME type in Discord, see discord.File. (I tested this by trying a test in Discord of attaching both a text file and an image. In both cases, I did sub tests with three different names, one with a .txt extension, one with a .jpg extension, and one with no extension at all.) The only supported metadata that we could possibly use to convey anything meaningful (other than filename), is description, but even that is only supported for images.

So if we wanted to attach some file, and wanted to have the contents of the file inlined, so that we could convey to the end user of Discord (and not just the runner of the script) what had happened, the only practical way I think is via a text file.

In other words, suppose there's an image called myimage.jpg, and it's not found. We could create a text file with the following contents:

File myimage.jpg not found

or perhaps

File myimage.jpg not found at URL https://files.slack.com/files-pri/...

or maybe with a WARNING: prefix to the above message.

(Downsides to the latter are that the URLs are somewhat long, and they also contain a token which is sort of private information, but not very private since the only thing it gives you access to are the file attachments that correspond to that Slack export, which you're posting into Discord anyway.)

And then when we attached the file to the message posted to Discord, we'd set the filename attribute to something like myimage.jpg--not-found.txt

This way you'd always see in Discord the right number of files attached to a message, and they would contain as much info as possible. Either the right thing if found, and a meaningful placeholder if not.

Thoughts? I'm leaning towards attaching the fake text file, but am undecided as to whether this should always be the case, or should the user have options as to how to handle file failures.

One more thing, regardless of what we do, at the very least I think we should communicate the issue to the runner of the script with a logger.warning() call.

richfromm commented 1 year ago

Apologies for waiting so long to get back to this...

Having finally decided (and implemented, although not yet merged) how to handle the file deleted case (see https://github.com/richfromm/slack2discord/pull/20/commits/3d5c3de428911be7f2582a4a9acc07917d00c611, attn: @waocats), I think I've also come to a decision about how to handle the file not found case.

I think the situation is somewhat different, in that the file deletion case has a clearly understood path as to how it can happen, but I still don't know what the flow is that results in the Slack export having a reference to an attached file, but the file not actually being present. But obviously it can happen, since that's what caused this PR. Maybe if the file deletion happens after the export was made, but before this script is run to import to Discord?

While we do have a little more information in this case (we know the filename), I think it's overkill to create a fake file attachment just to note that. I do still think we need to provide the user the option as to whether or not to fail or not in the 404 case (b/c there's more uncertainty than the known file deletion case), but I think that decision can be on a global basis, and it's overkill to prompt the user every time. That is, we can fail for all 404's, ignore them, but no 3rd ask option. And I'd rather err on the side of caution and make the default be failure, and the user has to specify a command line option to override and ignore.

waocats commented 1 year ago

Having that as a default sounds reasonable to me. That way, if a user leaves it unattended, they will know something is missing and won't be finding out weeks, months(!), etc. ahead of time unless they explicitly specified that they are fine with that.

W.r.t. to a 3rd ask option: I don't think there is any harm in having an option that asks every time, especially if there are images that are more important for them to keep than others (since we can't assume that these files were actually deleted by the users--we really don't know what happened to them). Realistically it would be nice to have another option at the y/N prompt to possibly specify an image to replace it with (either URL or local path), but I think I'm being a little idealistic and trying to indulge a very specific use-case. Just thinking out loud.

richfromm commented 1 year ago

A third option at a prompt wouldn't be too hard, but the point is that my tentative plan (coding it right now) is to not have an interactive prompt.

Basically, in the case of a 404, we'll fail right away, like before, but with a meaningful error message. If you want to investigate what's up then try rerunning again, that's fine. But if you're okay with ignoring the case of not found files, you can just rerun with an option (tentatively named --ignore-file-not-found), and it'll then just be a WARNING, not an ERROR.

I have a local test that I'm using for this, where I deleted a file from Slack after doing the export. That's my guess as to the situation here that originally caused this to happen.

Remember, not all attached files are images, which is partly why I decided not to go with some kind of default not found image.

I think this is enough of a corner case that providing this level of notification and control is adequate. If a user really wants to upload some file as a replacement, they can always just do that in Discord.

richfromm commented 1 year ago

Thank you for bringing this up, and apologies for taking so long to get to this.

For various reasons discussed both here and in https://github.com/richfromm/slack2discord/pull/20, I've taken a somewhat different approach to resolving this, so I am now closing this in favor of https://github.com/richfromm/slack2discord/pull/27

richfromm commented 1 year ago

Huh, I thought I closed this earlier today. Doing so now.