PERCEPTION

This tool combines various open source tools to give insight into accessibility and performance metrics for a list of URLs. There are several parts that can be understood as such:

This application requires a least one CSV wth a one column header labeled "Address" and one URL per line (ignores other comma delimited data).
A crawl can be also be executed (e.g. currently using a licenced version of ScreamingFrogSEO CLI tools https://www.screamingfrog.co.uk/seo-spider/)
Runs Deque AXE for all URLs and produces both a detailed and summary report (including updating the associated Google Sheet) See: https://pypi.org/project/axe-selenium-python/
Runs Lighthouse CLI for all URLs and produces both a detailed and summary report (including updating the associated Google Sheet) See: https://github.com/GoogleChrome/lighthouse
Runs a PDF audit for all PDF URLs and produces both a detailed and summary report (including updating the associated Google Sheet)

Get get started, follow the installation instructions below. Once complete:

Start the virtual environment ( python -m venv venv && source venv/bin/activate )
Run start app.py or python app.py.
Navigate to http://127.0.0.1:8888/reports/ or http://localhost/reports/ where the sample "DRUPAL" report will be visible.
View the report by clicking on the report address or providing the link as such http://localhost/reports/?id=DRUPAL
Here is a link to the sample data Google Sheet report: DRUPAL Google Sheet

NOTE: At the moment, no database is used due to an initial interest in CSV DATA ONLY. The system creates one folder for each as follows (under /REPORTS/your_report_name):

/AXE (used to store AXE data)
/CSV (CSVs to analyse; PDF CSV requests are appended with with a PDF qualifier)
/LIGHTHOUSE (used to store Lighthouse data)
/logs (tracks progress and requests)
/SPIDER (used to store crawl data)

At this point, a database would make more sense and adding a function to "Export to CSV", etc.

Workflow

As mentioned, simply provide a CSV with a list of URLs (column header = "Address") and select the tests to run through the web form.

The application is configured through environment variables. On startup, the application will also read environment variables from a .env file.

HOST (defaults to 127.0.0.1)
PORT (defaults to 8888)
SECRET_KEY (no default, used to sign the Flask session cookie. Use a cryptographically strong sequence of characters, like you might use for a good password.)
ALLOWED_EXTENSIONS (defaults to "csv", comma separated list)

Installation

To get all tests running, the following steps are required:

Linux Installation

sudo apt update

sudo apt install git

sudo apt-get install python3-pip

sudo apt-get install python3-venv

sudo apt-get update

sudo apt-get install software-properties-common

sudo add-apt-repository ppa:deadsnakes/ppa

sudo apt-get install python3.6

Clone and install requirements

git clone https://github.com/soliagha-oc/perception.git

sudo python -m venv venv

source venv/bin/activate

pip install -r requirements.txt

Run the python app and launch brwoser

python app.py

Browse to http://127.0.0.1:8888/ (or alternatively to port 5000 if you didn't set 8888 in the .env file)

CLI-TOOLS

Install the following CLI tools for your operating system:

chromedriver

Download and install the matching/required chromedriver

https://chromedriver.chromium.org/downloads
Download latest version from official website and upzip it (here for instance, verson 2.29 to ~/Downloads)

wget https://chromedriver.storage.googleapis.com/2.29/chromedriver_linux64.zip
Move to /usr/local/share (or any folder) and make it executable

sudo mv -f ~/Downloads/chromedriver /usr/local/share/

sudo chmod +x /usr/local/share/chromedriver
Create symbolic links

sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver

sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

OR

export PATH=$PATH:/path-to-extracted-file/

OR

add to .bashrc

geckodriver

Go to the geckodriver releases page. Find the latest version of the driver for your platform and download it. For example: https://github.com/mozilla/geckodriver/releases

wget https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
Extract the file with:

tar -xvzf geckodriver*
Make it executable:

chmod +x geckodriver
Add the driver to your PATH so other tools can find it:

export PATH=$PATH:/path-to-extracted-file/

OR

add to .bashrc

lighthouse

Install node

https://nodejs.org/en/download/

curl -sL https://deb.nodesource.com/setup_14.x | sudo -E bash -

sudo apt-get install -y nodejs
Install npm

npm install npm@latest -g

sudo npm install npm@latest -g
Install lighthouse

npm install -g lighthouse

sudo npm install -g lighthouse

pdfimages

https://www.xpdfreader.com/download.html

To install this binary package:

Copy the executables (pdfimages, xpdf, pdftotext, etc.) to to /usr/local/bin.
Copy the man pages (.1 and .5) to /usr/local/man/man1 and /usr/local/man/man5.
Copy the sample-xpdfrc file to /usr/local/etc/xpdfrc. You'll probably want to edit its contents (as distributed, everything is commented out) -- see xpdfrc(5) for details.

Google APIs

See this "Quick Start" guide to enable the Drive API: https://developers.google.com/drive/api/v3/quickstart/python

Complete the steps described in the rest of this page to create a simple Python command-line application that makes requests to the Drive API.

nginx (optional)

See: https://www.nginx.com/

ScreamingFrog SEO

See: https://www.screamingfrog.co.uk/seo-spider/user-guide/general/#commandlineoptions

ScreamingFrog SEO CLI tools provide the following data sets (required listed is bold): - crawl_overview.csv (used to create report DASHBOARD)

external_all.csv - external_html.csv (used to audit external URLs) - external_pdf.csv (used to audit external PDFs)
h1_all.csv
images_missing_alt_text.csv
internal_all.csv
internal_flash.csv - internal_html.csv (used to audit internal URLs)
internal_other.csv - internal_pdf.csv (used to audit internal PDFs)
internal_unknown.csv
page_titles_all.csv
page_titles_duplicate.csv
page_titles_missing.csv

Note: There are spider config files located in the /conf folder. You will require a licence to alter the configurations.

Note: If a licence is not available, simply provide a CSV where at least one column has the header "address". See DRUPAL example.

Deque AXE

Installed via pip install -r .\requirements.txt

See: https://pypi.org/project/axe-selenium-python/ and https://github.com/dequelabs/axe-core

Google Lighthouse

Lighthouse is an open-source, automated tool for improving the performance, quality, and correctness of your web apps.

When auditing a page, Lighthouse runs a barrage of tests against the page, and then generates a report on how well the page did. From here you can use the failing tests as indicators on what you can do to improve your app.

Quick-start guide on using Lighthouse: https://developers.google.com/web/tools/lighthouse/
View and share reports online: https://googlechrome.github.io/lighthouse/viewer/
Github source and details: https://github.com/GoogleChrome/lighthouse

Google APIs

Authentication

While there is a /reports/ dashboard, the system is enabled to write to a Google Sheets. To do this, set up credentials for Google API authentication here: https://console.developers.google.com/apis/credentials to get a valid "credentials.json" file.

Google Sheets Template

To facilitate branding and other report metrics, a "non-coder/sheet formula template" is used. Here is a sample template. When a report is run from the /reports/ route, the template is loaded (template report and folder ID found in globals.py and need to be setup/updated once), and the Google Sheet is either created or updated (unique report ID auto generated and found in /REPORTS/your_report_name/logs/_gdrive_logs.txt).

Running with sample data

If you have a Screaming Frog SEO Spider licence be sure to add it to your PATH. Even if Screaming Frog SEO Spider is not installed, a CSV can be provided to guide the report tools. Once installed, try to run the sample CSV. To do this:

Visit http://127.0.0.1:8888/
Enter a report name and email. Leave URL blank.
Click on "Choose File" under "Spider SEO Reports" to upload a file with a list of URLs, column header = 'address'.
Select the tests you wish to run.

NOTE: This would exclude PDFs which require a list of exclusively PDF URLs.

As these tst can take a while to run, please check back at the http://127.0.0.1:8888/reports/ page for progress.

Running a sample can be accomplished two ways, using the samples provided in the "/REPORTS/DRUPAL/" folder or by downloading and installing Screaming Frog SEO Spider and running a free crawl (500 URL limit and no configuration/CLI tool access). Once the crawl is completed or file created, create/save the following CSVs:

crawl_overview.csv (via "Reports >> Crawl Overview" in the ScreamingFrog menu) - used to create Report Overview. Without this CSV, the Report Overview will be missing (working on calculating the results to eliminate this report)
internal_html.csv (via "Export" button in the ScreamingFrog interface) - used to point the reporting tools to the desired URLs
internal_pdf.csv (via "Export" button in the ScreamingFrog interface) - used to point the reporting tools to the desired URLs
external_html.csv (via "Export" button in the ScreamingFrog interface) - used to point the reporting tools to the desired URLs
external_pdf.csv (via "Export" button in the ScreamingFrog interface) - used to point the reporting tools to the desired URLs

If another method is used to crawl a base URL, be sure to include the results in a CSV file where at least one header (first row) reads "Address", provide one or more web or PDF URLs, and ensure that the filename(s) is the same as the one listed above and in "/REPORTS/your_report_name/SPIDER/" folder. At least one *_html.csv file is required and to be in the appropriate folder.

Cautions

Spider, scanning, and viruses

It is possible when crawling and scanning sites to encounter various security risks. Please be sure to have a virus scanner enabled to protect against JavaScript and other attacks or disable JavaScript in the configuration.

soliagha-oc / perception

readme

PERCEPTION

Workflow

Installation

Linux Installation

Clone and install requirements

Run the python app and launch brwoser

CLI-TOOLS

chromedriver

geckodriver

lighthouse

pdfimages

Google APIs

nginx (optional)

ScreamingFrog SEO

Deque AXE

Google Lighthouse

Google APIs

Authentication

Google Sheets Template

Running with sample data

Cautions

Spider, scanning, and viruses