jyn514 / GradeForge

Courses available from my.sc.edu
GNU General Public License v3.0
1 stars 0 forks source link

Integrate with bookstore #13

Open jyn514 opened 6 years ago

jyn514 commented 6 years ago

Currently we store a link to a link to the bookstore's webpage. This is not ideal. The reason we don't just link directly is it requires a POST in order to access. The reason we don't just parse it and be done is because it's loaded dynamically (on the client-side) with some extremely malicious/obfuscated JavaScript.

I would eventually like to list (at a minimum) the ISBN numbers of all required textbooks. Unfortunately, this will be impossible until there is a way to parse the site. @charlesdaniels mentioned selenium, but I've had a hard time getting it working.

jyn514 commented 6 years ago

Parsing is done in https://github.com/jyn514/GradeForge/commit/89da8f1ad6470989688308caeda1141472b3a014, just needs to be added to follow_links to be included the dump. SQL for this has not yet been started.

jyn514 commented 6 years ago

Will need to follow links for every section individually. Assigning @charlesdaniels as our massively parallel expert to be in charge of this. Relevent function is get_bookstore_selenium and parse_bookstore in download.py and parse.py, respectively. CLI is gradeforge download bookstore <department> <code> <section> if that helps.

jyn514 commented 6 years ago

I need to make the parsing less slow. We can start by having a single instance of chrome instead of a different one for each section The method supports this, it just needs to be used by the calling code. After that we can try and use a callback to see when the page is loaded instead of guessing Done in https://github.com/jyn514/GradeForge/commit/6093ad666e89dcde67dcfe70974f7b6b599b08dd.

jyn514 commented 6 years ago

This is fully done on the parse side in https://github.com/jyn514/GradeForge/commit/340f041aa369ecac92c573ed5bcfdb938eedd290. It's not yet concurrent, so you can work on that if you like.

jyn514 commented 6 years ago

Just a thought - the limiting factor for concurrency is not actually downloads but the driver. if there were some way to set up a client/server model for the driver, that would mean not everything has to be done in the same process. this would also allow multiple drivers (maybe 4 max to avoid DOSing the server) to run at once.

the reason I'm not considering a different driver for each process is that takes about 5 seconds to start the driver, compared to ~.5 seconds for actually downloading and parsing a section.

charlesdaniels commented 6 years ago

What you want is called Selenium Grid.

However, since it sounds like you are curerntly choking on setup time, you may be better off setting up a single global driver shared across the entire application. I have had good success with this method in the past.