petermatts / CarStats

This project aims to compile a list of key statistics across all common car models and brands, for ease of comparison for a user.
2 stars 0 forks source link
data-processing selenium-python statistics



This project aims to compile a list of key statistics across all common car models and brands, for ease of comparison for a user.

It does this via webscrapping the reliable

Note there is no front end for this yet

Note for the time being the Data folder is gitignored as it is quite large


Using python's selenium module this project is able to webscrape all brands and their models.

The processes of extracting the information from each cars' specs page is currently in construction.

A method to 'iteratively' scrape all models (with variants over all years that model is available) for each brand is currently on the docket to be done.

Ideally once all data is collected it should be displayed properly, most likely in a table.


Graphic of all specs being taken into consideration coming soon or see Base.txt or Base.csv in the Docs folder.

Future Designs and Plans

See Todo.txt

Running the project (websrcaping the data)

  1. cd src

  2. Obtain links to scrape by running within the

    This writes AllBrandsAndModels.json

  3. Next run --summary

    This generates the Links directory, each brand has its links within a txt file here i.e. Links/${BRAND}.txt

  4. Run --check to go through the links in the Links directory and log all invalid links.

    Invalid links are logged to Log/ErrorLinks.csv.

  5. To correct the invalid links of step 4, copy Log/ErrorLinks.csv to Log/ErrorLinks-Fix.csv appending the corrected link to the model page (main or specs, doesn't really matter) (if any) to the 3rd column. You may use the 4th column for notes. Unfortunately this is a manual process... the coded solution would not be much better

    For example:

    • future (car is not yet released thus no specs)
    • no specs page (car exists but does not have a specs page provided)
    • And more

    To correct these links in Links directory by running --fix. This overwrites the bad link with the corrected link you found that works.

  6. (Optional) you may create Docs/AllLinks.txt by running --all

  7. Go up a directory into the main directory cd ..

  8. Run the data scraper

    • Required: pass the brand you would like to scrape as first argument
    • Optional: pass specific model within brand you would like to scrape
    • Optional: pass in year or --latest to scrape a specific year or the latest year available

    This writes data to Data/YAML/*

  9. (Optional, but recommended) run in the src/data directory to search for and remove duplicates from the YAML data.

  10. Run to find and fix files with bad names, i.e. files that have a "-{year}" in the name.

    Run with --detect option to display all problematic files and --fix to fix these files.

    Note: it may be a good idea to back up the Data folder to another place before doing this. It should be fine, but just in case.

  11. Run --yaml-json or --yaml2json to convert the YAML data into JSON data.

  12. Run --json-csv or --json2csv to convert the JSON data into the final formatted CSV data

    Note that this requires the creation of Base.txt and Base.csv. Which can be done by running with the --txt and --csv flags respectively. See the file for more details.

    Note: both steps 11 and 12 utilize the correction functionalities generated by

    Running this file generates the Corrections directory and the files and In the Corrections directory are autogenerated template files for implementing data corrections/formats.

    WARNING: running this file again will overwrite manual corrections made to these template files. Use files like to update the files if keynames need to be changed.

    The file serves as a template/superclass for the correction files. The file drives the correction files and connects them to the functionality in

    Note: The file logs the status of implementation of specification corrections by brand to an excel (.xlsx) sheet.

  13. Compliling the data into one source. Run This generates a CSV file for each brand named ${BRAND}.csv in each brand's csv data directory. It also creates the file Data/CSV/AllData.csv. This file will contain all the data obtained.