iulspop / slack-web-scraper

Puppeteer configured to scrape the posts and threads of any channel on Slack.
MIT License
66 stars 19 forks source link
puppeteer slack webscraper

Slack Web Scraper

A web scraper that navigates to a Slack workspace and saves the posts and threads of a given channel or DM.

It uses Puppeteer headless browser for loading and interacting with Slack. It doesn't depend on installing an app in the Slack workspace or aquiring an API key. Instead, it logins to your Slack account and uses that to access the channel or DM.

It's helpful for saving information from a channel or DM without needing to ask a workspace administrator to export the data.

For example, if you're in the process of leaving your current company to join another, this tool is a great way to archive everything you've said and done on Slack.

How to collect Slack data?

  1. Run npm install to install the dependencies.
  2. Copy the .example.env file in the project root folder and rename it to .env. Then modify following environment variables in .env:
  1. Before starting the scrape, make sure the Slack App language is set to English. You can reset it once the scrape is finished.

  2. Run npm run collect. You will see the browser open and start scraping data unless you set HEADLESS_MODE to true. In headless mode you will see status updates on the scraping process in the console output.

Tip for collecting data with Windows Subsystem for Linux

You need to configure WSL to connect to a GUI even if the browser launches in headless mode. Use this guide to configure WSL to connect to an X server installed on Windows. Before running the collect script, the X server must be open and WSL correctly configured to connect to it, or Puppeteer will fail to launch the browser.

How to parse Slack data?

  1. Assuming you already ran npm run collect, you can now run npm run parse. You will be prompted to select the file to parse from the slack-data/ folder. Once the parsing is complete, a slack-data/x.json file with same name as the source HTML file will be output with the parsed posts/threads.

Contributors āœØ

Thanks goes to these wonderful people (emoji key):

Iuliu Pop
Iuliu Pop

šŸ¤” šŸ’» šŸ“– šŸ‘€ šŸ’¬
William Desportes
William Desportes

šŸ’» šŸ›
NotEdwin
NotEdwin

šŸ› šŸ’»

This project follows the all-contributors specification. Contributions of any kind welcome!

Contributing

Very open to contributions to this project! If you have questions, bug reports or features you want to see, please open an issue. If you want to contribute code, open a pull request and I'll review ASAP.

License

MIT