Open jyn514 opened 6 years ago
Parsing is done in https://github.com/jyn514/GradeForge/commit/89da8f1ad6470989688308caeda1141472b3a014, just needs to be added to follow_links
to be included the dump. SQL for this has not yet been started.
Will need to follow links for every section individually. Assigning @charlesdaniels as our massively parallel expert to be in charge of this. Relevent function is get_bookstore_selenium
and parse_bookstore
in download.py
and parse.py
, respectively. CLI is gradeforge download bookstore <department> <code> <section>
if that helps.
I need to make the parsing less slow. We can start by having a single instance of chrome instead of a different one for each section The method supports this, it just needs to be used by the calling code. After that we can try and use a callback to see when the page is loaded instead of guessing Done in https://github.com/jyn514/GradeForge/commit/6093ad666e89dcde67dcfe70974f7b6b599b08dd.
This is fully done on the parse side in https://github.com/jyn514/GradeForge/commit/340f041aa369ecac92c573ed5bcfdb938eedd290. It's not yet concurrent, so you can work on that if you like.
Just a thought - the limiting factor for concurrency is not actually downloads but the driver. if there were some way to set up a client/server model for the driver, that would mean not everything has to be done in the same process. this would also allow multiple drivers (maybe 4 max to avoid DOSing the server) to run at once.
the reason I'm not considering a different driver for each process is that takes about 5 seconds to start the driver, compared to ~.5 seconds for actually downloading and parsing a section.
What you want is called Selenium Grid.
However, since it sounds like you are curerntly choking on setup time, you may be better off setting up a single global driver shared across the entire application. I have had good success with this method in the past.
Currently we store a link to a link to the bookstore's webpage. This is not ideal. The reason we don't just link directly is it requires a POST in order to access. The reason we don't just parse it and be done is because it's loaded dynamically (on the client-side) with some extremely malicious/obfuscated JavaScript.
I would eventually like to list (at a minimum) the ISBN numbers of all required textbooks. Unfortunately, this will be impossible until there is a way to parse the site. @charlesdaniels mentioned selenium, but I've had a hard time getting it working.