Try to multithread schedule generation

ahoskins commented 9 years ago

Big Boi.

rosshamish commented 9 years ago

Gotta see how fast this bb can go

rosshamish commented 9 years ago

Oh boy, check this out. Big improvement with the threading. This benchmark assesses total time taken to process 5 schedule generation requests:

Synchronous

time: 3.754

(classtime)ross(master)*$ nosetests tests/test_schedule_generator.py --nologcapture --nocapture --pdb
INFO:   Making schedules for courses ['002896', '001341']       11:08:47  [classtime.scheduling.schedule_generator]
.INFO:  Making schedules for courses ['001343', '004093', '004096', '006768', '009019']     11:08:47  [classtime.scheduling.schedule_generator]
.INFO:  Making schedules for courses ['010344', '105014', '105006', '105471', '006708', '010812']       11:08:48  [classtime.scheduling.schedule_generator]
.INFO:  Making schedules for courses ['006973', '006790', '006974', '098325', '001607', 04104']     11:08:49  [classtime.scheduling.schedule_generator]
.INFO:  Making schedules for courses ['001343', '004093', '004096', '006768', '009019'] 11:08:49  [classtime.scheduling.schedule_generator]
.
----------------------------------------------------------------------
Ran 5 tests in 3.754s

OK

Threaded

time:: 1.279s

(classtime)ross(i36-multithread-schedule-generation)*$ nosetests tests/test_schedule_generator.py --nologcapture --nocapture --pdb
INFO:   Making schedules for courses ['002896', '001341']       11:10:44  [classtime.scheduling.schedule_generator]
.INFO:  Making schedules for courses ['001343', '004093', '004096', '006768', '009019']     11:10:44  [classtime.scheduling.schedule_generator]
.INFO:  Making schedules for courses ['010344', '105014', '105006', '105471', '006708', '010812']       11:10:44  [classtime.scheduling.schedule_generator]
.INFO:  Making schedules for courses ['006973', '006790', '006974', '098325', '001607', '004104']       11:10:44  [classtime.scheduling.schedule_generator]
.INFO:  Making schedules for courses ['001343', '004093', '004096', '006768', '009019']     11:10:45  [classtime.scheduling.schedule_generator]
.
----------------------------------------------------------------------
Ran 5 tests in 1.279s

OK

Threaded version distributes the schedule search among 10 worker threads. At each iteration, it breaks up the work into 10 sections, distributes it, then aggregates the results as they come back from each worker.

ahoskins commented 9 years ago

Damn, that's fast. Is there any potential downside of threading? Like is there a reason in the future why synch might be a good thing? (just being the devil's advocate)

rosshamish commented 9 years ago

Haha, turns out it was a false positive. There was a bug which caused only one thread to ever spawn, and since each thread was assigned about 1/5 of the workload, only about 1/5 of the search space was being processed :-1: (this made it very fast though :laughing: )

I fixed the bug, and got real multithreading going, but the multithreaded schedule generation is actually slower - upwards of 6 or 7 seconds using the same 5-request benchmark, compared to less than 4s for the synchronous version.

After googling around, the reason seems to be the Global Interpreter Lock that Python has. Aka: only one thread can execute at a time in python - they aren't actually executing concurrently. For IO bound tasks like waiting for user input or redrawing a GUI, this works. For CPU bound tasks like our schedule generation, this is bad: extra CPU time is required to manage threads, but available CPU time remains the same.

I'm going to look into multiprocessing instead.

rosshamish commented 9 years ago

And to answer your question: multithreading/multiprocessing is inherently more complex - race conditions, shared resources, etc. are all problems that don't exist in a synchronous environment. However, Heroku supports up to 255 processes/threads on a single dyno, and PostgreSQL (which we'll use as a database in production) supports concurrency.

So yes, there are potential issues but we'll keep the synchronous version around (in git history) and we can revert back if problems become unmanageable.

rosshamish commented 9 years ago

So, the multiprocessing approach seems to affect performance based on two factors: the working set size and the number of worker processes used. I did some good 'ol research to look for an optimal combination of the two.

working set size: the number of "best" schedules which are kept around at any one time. A larger working set means less chance of mistakenly discarding a potentially great schedule.

For each working set size, I varied the number of workers used, ran the benchmark (tests/test_schedule_generator.py), and compared those times to the synchronous approach.

Working set of 50 schedules

Synchronous approach: 3.8 seconds with a working set of 50 pool50sched-gen-perf

Working set of 200 schedules

Synchronous approach: 12 seconds with a working set of 200 pool200sched-gen-perf

Interestingly, the second chart shows that it's possible to quadruple the working set size and still improve performance by about 25%. This requires at least 20 processes, and at least 32 for optimal performance. That's a lot, but we can potentially pull it off in production:

we'll have a maximum of 5 web processes
we'll be limited to 255 processes+threads per dyno
each web process can handle one schedule generation request at a time
worst case: all web processes are serving a generation simultaneously
- that's 5 + 5*32 = 165 processes, which leaves us with 255-165 = 90 processes left over for other tasks.

ahoskins commented 9 years ago

Interesting. Love the graphs. So it seems really worth it to multi-process then.

The size of the working set seems very important, especially since the raw set is so large sometime. There are many thousand (or million?) possible schedules for first year engineering? I forget the number you told me earlier.

I sat in the car from Cochrane and back today (and still am) so I was reading about heroku and dyno's. Turns out it is true that if you only have one web dyno it will shut off after inactivity. So I wonder if using only one dyno in production is realistic...

ahoskins commented 9 years ago

Lol you closed the issue in the time I was typing my message on my iPhone.

rosshamish commented 9 years ago

Haha, comments still work though, all good. I wrote "Closes #36" in one of those commits, so when I pushed them GitHub closed it for me.

I think the number was something like 3 million first year engineering schedules? I could find out again, but it's probably not important - it's somewhere up there. For smaller / more restrictive requests, the number is a lot smaller.

I was also reading about that, and found some services that will ping your site every X minutes to prevent it from idling/sleeping. That's kind of cheating I think - Heroku offers the free tier as a "development" tier, not a production tier. However, we could probably do that initially and see what happens from there.

rosshamish commented 9 years ago

The nice thing though is that we'll be able to scale very easily if need be - our dynos won't depend on each other at all and Heroku uses a load balancer to send requests equally between them.

We'll need to switch over to a concurrency-safe database though - Heroku offers PostgreSQL for a certain $fee per month I believe. SQLite is filesystem based, which won't work on more than one process/thread/dyno.

ahoskins commented 9 years ago

Right now is it SQLLite behind SQLAlchemy? it's SQLAlchemy as the ORM. I didn't know SQLLite was currently involved also

rosshamish commented 9 years ago

As far as I can tell, SQLAlchemy is a layer which abstracts the server's database implementation, similar to how jQuery abstracts the client's browser implementation.

Ya, it's SQLite right now (sqllite:////). The database URI is defined in angular_flask/settings.py. To migrate, we change the prefix to postgres://, as well as other minor changes.

rosshamish / classtime