Multithreaded, will use number of threads equal to available system cores.
Saves HTML into correct cache (/cache/capes_storage).
Similar to course scraper (without the nested directories).
Parsing:
Iterates through each CAPES HTML file and breaks it down from there.
Each row in the main CAPES HTML table is stored as a row in the SQL table.
Data is cleaned: letter grades are removed (only GPA is kept - letters can be recreated), course numbers are isolated, percentages are removed, etc.
Saves each row into the SQLite CAPES_DATA table.
Exporting:
Transfers all rows in the SQLite CAPES_DATA table to the MySQL CAPES_DATA equivalent table.
To see the CAPES_DATA table, just SSH into the MySQL database container and run mysql -u root -p classes and enter in the correct password. Then, run SELECT * FROM CAPES_DATA ... everything should be there.
I've also cleaned up some of the other scrapers and settings by renaming some important environment variables. Additionally, there was a lot of os.chdiring because it looks like relative file names were previously being used (NOT GOOD). I've removed all of these unnecessary stateful calls because they can only screw with logic down the line.
It's finally done.
Scraping:
Parsing:
Exporting:
To see the CAPES_DATA table, just SSH into the MySQL database container and run
mysql -u root -p classes
and enter in the correct password. Then, runSELECT * FROM CAPES_DATA
... everything should be there.I've also cleaned up some of the other scrapers and settings by renaming some important environment variables. Additionally, there was a lot of
os.chdir
ing because it looks like relative file names were previously being used (NOT GOOD). I've removed all of these unnecessary stateful calls because they can only screw with logic down the line.