Closed ghost closed 7 years ago
I'll pull this in after I have some more time to test.
Nice, a lot of solid changes here.
Regarding output to a CSV - I propose we drop the use of Python's built-in CSV writer and simply write the values directly to a file. The reason is that there is a lot of overhead in the CSV handler since it's aim is to be flexible and easy to use quickly. This is not our use case since we know what the data will look like and the overall format in advance. You can google around and find a few stack overflow posts that discuss this performance, but I wanted to compare EmotivWriter with these others to see just how big of an impact that wrapper has. Below are the results of running some performance test code I wrote on my machine:
Data shape is (38400, 20)
--original_emotiv
Avg time: 0.2251s (stddev=0.0029s)
Samples per sec: 170591.1190s
--python_csv
Avg time: 0.1829s (stddev=0.0013s)
Samples per sec: 209962.6777s
--direct_to_fs
Avg time: 0.0374s (stddev=0.0003s)
Samples per sec: 1026835.9575s
--direct_to_fs_chunked
Avg time: 0.0297s (stddev=0.0004s)
Samples per sec: 1295064.5118s
It's pretty clear that we have some significant overhead here - but it's important to understand that multithreading this writer will probably not remove all of this overhead from the main thread due to Python's GIL. The roughly 6-7x overhead I demonstrated above is primarily made up of time spent executing Python code - not writing to the file. If that thread is executing Python code, then no other thread can execute Python code (see: GIL), so while multi-threading will help, we still need the writer to minimize it's need for user-space CPU time.
I had planned on implementing and submitting a PR for this, hence the performance test .
Edit: Removed a few absolutes :P Edit2: Placing quotes around everything is also not insignificant, it adds a quite a bit more data to each write. If someone needs full quoting, it makes more sense to just process the CSV after recording the data
If we still have issues with performance writing to a CSV, we can launch the writer in a separate process and queue/pipe packets over.
Sounds good to me.
Yeah I thought about that as well, however I want to stream the data out of a TCP/UDP server at some point. The threading seems to keep up fine, one of the things that was really slowing down the export was the writing to the console in the main thread which was skewing the sample rate also.
Of course I'm doing all this on a MacBook Pro so it might not be running good on other platforms.
On python2 it's very consistent(with my system), I think with your suggestion Python3 will work good as well.
Of course keeping the data in a buffer until the end of the session might work if we catch Keyboard Interrupt and set a flag for the writer to flush it to disk when exiting.
I'm not sure that buffering everything in memory will buy us much for the memory pressure it will induce and the unease of having the data in volatile storage. By chunking the data, we can effectively compromise between the two extremes of write-every-record and buffer-every-record.
So the python3 performance seemed to stem from the GIL like you were saying, fixed the sleep times and it performs the same if not better now. Also, writing directly to the file now.
Yeah chunking would be good too I suppose.
Lots of stuff updated here.
Might resolve #211
Might resolve #209
Might resolve #207
Might resolve #205
Might resolve #204