Export only users participating or part of exported channels or conversations

rusq / slackdump

Save or export your private and public Slack messages, threads, files, and users locally without admin privileges.

GNU General Public License v3.0

813 stars 60 forks source link

Export only users participating or part of exported channels or conversations #287

Open cdeszaq opened 1 month ago

cdeszaq commented 1 month ago

Is your feature request related to a problem? Please describe.

My Slack has many more users than I care about, since I only care about a subset of conversations. Exporting all of them is a large waste of time and resources.

Describe the solution you'd like

I would like an option to build up the user information / cache as the desired channels and conversations are exported or dumped, rather than either not cache the information or not export the user information.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Simply not having user information - Just userIDs has a certain "obfuscation" appeal, but still greatly reduces the usability of the export
Manually maintaining the user info file - Not appealing because there are still many users I do care about, and they get added over time, so manual maintenance would be burdensome
A way to update an existing user cache, rather than replace it - This would probably work, though given a large-enough set of users could still be burdensome

Additional context

None

rusq commented 1 month ago

Hey @cdeszaq, if you can be bothered, try building the v3 from the master branch, it has some major improvements. There's no documentation (you can try reading the man page - in the root of the project type man ./slackdump.1, or refer to this comment which I left in another issue: https://github.com/rusq/slackdump/issues/273#issuecomment-2028014079

The v3 branch was merged into master, so ignore the comment saying "checkout the v3 branch".

cdeszaq commented 1 month ago

Ahh, I see! I'll happily play w/ v3 and report back. I'm currently doing so with a checkout of master, in fact, with this command:

./slackdump export -enterprise -cache-dir /Users/cdeszaq/playground/slackdump/cache -o /Users/cdeszaq/playground/slackdump/exportTest4 -v <chanID1> <chanID2>

(the cache directory doesn't see to be getting used, but that's not so much "functionality" related, so I don't mind)

But, after downloading the channel messages and files it seems to be trying to also download all the users. I've not dug into the man pages yet, and haven't dug closely through the linked issue, but I'll try those things next.

Otherwise, any pointers to what I'm missing or doing wrong in my incantation above would help as well I'm sure.

cdeszaq commented 1 month ago

From the head of master, with a vanilla archive command limited to 2 channels for an enterprise slack, I'm still seeing what looks like "download all the users" behavior, though the program simply appears to hang based on the terminal output because I stop getting output (in non-verbose mode) after it has finished downloading the threads. In verbose mode (with the -v flag) I see a regular march of output at least showing that something is happening.

Command:

./slackdump archive -enterprise  <chanID1> <chanID2>

-v trailing output:

archive: 2024/05/06 17:00:04.897276 network.go:136: success
archive: 2024/05/06 17:00:04.912997 network.go:136: maxAttempts=20
archive: 2024/05/06 17:00:06.438011 network.go:136: success
archive: 2024/05/06 17:00:06.452961 network.go:136: maxAttempts=20
archive: 2024/05/06 17:00:07.920001 network.go:136: success
archive: 2024/05/06 17:00:07.926486 network.go:136: maxAttempts=20
archive: 2024/05/06 17:00:08.971212 network.go:136: WithRetry: slack rate limit exceeded, retry after 30s (*slack.RateLimitedError) after 1 attempts
archive: 2024/05/06 17:00:08.971262 network.go:136: got rate limited, sleeping 30s
archive: 2024/05/06 17:00:39.665184 network.go:136: success

That said, I do like the default output directory name (simply stamping the directory name with the execution date/time saves me a step I inevitably would do!)

rusq commented 1 month ago

That's great! Just checked, yes, unfortunately you can't skip downloading all the users, this is something that I'll look into, I can't promise that I'll be able to do it in the nearest future due to life circumstances, but it will be on my list when I get back to it. I hope the v3 is better suited to what you're trying to do and you find it more pleasant to use than v2, but it would be great if you could report any issues.

Regarding the rate limit - there were instances where slack would profoundly restrict user endpoint, you can see the exact endpoint that was limited, if you enable tracing ./slackdump archive -enterprise -v -trace=trace.out <chanID1> ..., and then run go tool trace trace.out on it, it will be in the User defined tasks and User defined regions. Luckily seems like the retry recovery worked?

cdeszaq commented 1 month ago

Retry recovery seems to work, as it is cranking through the users regardless of being rate-limited.

The newer version indeed seems much nicer. I've yet to try the -member-only option, which is another key desire I have. I'm not sure if it will work on archive or only on export (nor am I very clear on the differences between the two), but I'll play with it.

Any pointers on where/how to start hacking on a flag to "queue users to download when encountered" (and only download those users) would be useful. I may take a whack at hacking it in myself! :-)

rusq commented 1 month ago

I will have to do a bit of diving here to explain.

V3 has a concept of "chunks" - it is a centralised format that represents the "chunk" of api output, so each endpoint call maps to one "chunk type", i.e. WorkspaceInfo, User, ChannelInfo, Messages, ThreadMessages etc. Chunk structure, if you look at it, is universal, and contains all possible payload types that could be grabbed from the API endpoints. Depending on the API call, the respective chunk type is set, and the relevant payload member variables are populated, and then the structure is Marshalled into the Writer. One could call it a "native slackdump format", because internally, in v3, everything goes through the chunk format.

"archive" creates a "recording" of the API output, that can even be "replayed" later to mock the actual SlackAPI output. It can be converted to "export" format later, if required with slackdump convert.

"export" actually is generating chunk files in temporary directory, and then converts it to "export" format "on the fly" to the destination directory. The same happens when you run "dump".

"archive" creates a bunch of gzipped JSONL files in the directory
"export" creates a bunch of JSON files in the Slack Export format that could (potentially, untested yet) be loaded into another slack instance.
"convert" can be used to convert from "archive" to "export" formats (but not the other way around)
"view" can be used to browse through the messages in the browser, it "understands" all three possible output formats - archive, export and dump.

cdeszaq commented 1 month ago

To give a sense of the amount of time my Slack's user-data download takes (and the motivation for this overall feature request): that user portion of the archive (of 2 small channels) took more than 10 hours to retrieve more than 4 GB of data (uncompressed). The channel contents themselves (with a few files) took 9 seconds.