Refer to the complete documentation and this series of blog posts for more detailed information.
It is against ProBoards's Terms of Service to scrape content from a ProBoards forum. The code in this repository is purely for educational purposes, i.e., to demonstrate the use of various libraries and techniques, and should NOT be used to scrape any ProBoards forum or website.
Neither the author(s) nor this repository have any affiliation or association with ProBoards.
Per the license included in this repository, this software is provided "as is" without warranty of any kind and is not guaranteed to work. Neither the author(s) nor the software shall be held liable for any consequences resulting from its use.
The purpose of this package is, as the disclaimer above states, to
demonstrate the use of various Python modules/packages and various
web-scraping techniques. It is designed to crawl a forum in a top-down
manner and store user profiles, categories, boards, threads, polls, posts,
shoutbox posts, post smileys, user avatars, and the site background/banner
images primarily in a SQLite database. Scraping is achieved via a
combination of BeautifulSoup and Selenium, and sqlalchemy is used to
interface with the SQLite database. Because the majority of this task
involves HTTP requests and network I/O, the forum is scraped asynchronously
using asyncio
, aiohttp
, and aiofiles
.
The figure below illustrates the scraper architecture and data flow at a high level.
The proboards_scraper.scraper
module contains functions that are called to
scrape the entire site or parts of the site. For example,
scrape_forum()
scrapes the forum homepage (including shoutbox posts) and
calls scrape_users()
and scrape_board()
to get user profiles and all
boards, respectively. scrape_board()
recursively scrapes all sub-boards
as well as all threads (and their posts) contained in a given board. Any one
of these functions can be called on their own to scrape a particular section
of a site, e.g., only user profiles or only a specific board.
Regardless, all of the aforementioned functions take a ScraperManager
class
instance as one of their parameters. The ScraperManager
handles tasks
related to performing HTTP requests/downloads, and also serves as an interface
with a Database
class instance which, in turn, serves as an interface for
querying or inserting items into the SQLite database.
For more information, refer to the complete documentation.
git clone git@github.com:nrsyed/proboards-scraper.git
cd proboards-scraper
pip install .
Scraping is performed via the pbs
(ProBoards Scraper) command
line tool. Login is not required to scrape a site. If authentication
credentials, i.e., username and password, are not provided, the program will
proceed without logging in and any password-protected areas of the site will
not be scraped. The following examples demonstrate basic use of the pbs
command.
# Scrape the entire forum.
pbs https://yoursite.proboards.com -u user -p pass
# Scrape all user profiles.
pbs https://yoursite.proboards.com/members -u user -p pass
# Scrape a specific user's profile.
pbs https://yoursite.proboards.com/user/4 -u user -p pass
# Scrape a specific board (including all its threads and sub-boards).
pbs https://yoursite.proboards.com/board/2/boardname -u user -p pass
# Scrape a specific thread.
pbs https://yoursite.proboards.com/thread/123/thread-title -u user -p pass
By default, the command stores files in ./site
, with the database file named
forum.db
and all downloaded images stored in ./site/images
:
site
├── forum.db
└── images
├── 0109df55a94edf945e04bfa1ac494133.png
├── 44af035a39a673cce28d10d2c7a7ef0.gif
├── 791ec775aa570e88734cf9e83c4105966.ico
└── fd0d26b36a29dc621b7aebd1a4d5a0d7.jpg
The output directory can be changed from ./site
with the -o
/--output
option:
pbs https://yoursite.proboards.com -o /path/to/directory
The package includes a simple command line utility for querying the database
with pbd
(ProBoards Scraper Database tool).
# Print all boards.
pbd -b
# Print detailed information (sub-boards, thread list, etc.) for a
# specific board id.
pbd -b 2
# Print a list of all threads.
pbd -t
# Print detailed information (poll, posts) for a specific thread id.
pbd -t 500
# Print a list of all users, including guests.
pbd -u
# Print detailed information about a specific user id.
pbd -u 23
The tool assumes the database is located at ./site/forum.db
by default, but
a different database file can be specified with the -d
/--database
option:
pbd -d /path/to/database.db -u 23
The ProBoards Scraper command line tool pbs
can be
used to scrape part or all of a ProBoards forum.
usage: pbs [-h] [-u USERNAME] [-p PASSWORD] [-o <path>] [-D] [-U]
[-v {0,1,2,3,4,5}] url
positional arguments:
url URL for either the main page, a board, a thread, or
a user
optional arguments:
-h, --help show this help message and exit
-o <path>, --output <path>
Path to output directory containing database and
site files (default ./site)
-D, --no-delay Do not rate limit requests
-U, --no-users Do not grab user profiles (only use this option if
a database exists and users have already been added
to it)
-v {0,1,2,3,4,5}, --verbosity {0,1,2,3,4,5}
Verbosity level from 0 (silent) to 5 (full debug);
default 2
Login arguments:
-u USERNAME, --username USERNAME
Login username
-p PASSWORD, --password PASSWORD
Login password
The ProBoards Scraper Database command line tool pbd
can be used to perform simple queries on the database generated from scraping
the site via the pbs
command line tool.
usage: pbd [-h] [-d <path>]
(--board [board_id] | --user [user_id] | --thread [thread_id])
optional arguments:
-h, --help show this help message and exit
-d <path>, --database <path>
Path to database file; default ./site/forum.db
--board [board_id], -b [board_id]
Board id; if omitted, list all boards
--user [user_id], -u [user_id]
User id; if omitted, list all users
--thread [thread_id], -t [thread_id]
Thread id; if omitted, list all threads