Threaded prof.write(), fixes #6

Hey here's a thing. I wrote it after probably over-instrumenting my code: it takes a save of around 500k events from, like, minutes, to around 2 seconds on my cheapo i5-3320M laptop, so it's a huge improvement.

The benefit on smaller captures, like for example 50k records, is way smaller, but it doesn't seem to hurt performance. I'm leaving this as a opt-in feature because it requires some manual intervention to start the threads, but I don't see much reason to ever not use threaded writes.

Code reviews, opinions etc welcome~

This is an opt-in feature: to enable, call prof.enableThreadedWrite() at the start of your program. Then instead of saving each event on the main thread, and doing all the serialization work there, each event will be assigned to a pool of worker threads which will serialize each event in chunks at the end of the program.

Potential improvements to this model could include:

Writing serialized data to a byte buffer instead of a string. This would save the cost of copying the chunk string between VMs, with the added complexity of handling ownership of the buffer and potentially having data that grows beyond the size of the buffer.
Incremental serialization. Right now the worker threads wait until the end of the program to start processing each event, but there's no reason they can't do that work ahead of time in the background if it doesn't affect the runtime of the main program (it might?)
Handling file I/O on a background thread, or even possibly in the worker threads themselves. Haven't thought about this one too much, because prof.write() is typically called at the end of the program where there's not much else going on.

pfirsich / jprof

Threaded prof.write(), fixes #6 #13