Multi Processing Indexer

felixhummel commented 11 years ago

Hi,

I hacked around a bit to make shiva-indexer run on multiple CPUs. "Rough around the edges" would be a compliment, because it breaks lastfm and took some heavy refactoring, but I am happy with the results.

I'm running the following command and set my DB to /dev/shm/shiva.db to remove some I/O from the timings:

python setup.py develop && rm -f /dev/shm/shiva.db && shiva-indexer > /dev/shm/indexer.log && tail /dev/shm/indexer.log

On master:

Run in 23 seconds. Avg 0.006s/track.
Found 4088 tracks. Skipped: 0. Indexed: 4088.
flac: 14 tracks
mp3: 32 tracks
ogg: 4042 tracks

On my multi-processing branch:

Run in 7 seconds. Avg 0.002s/track.
Found 4088 tracks. Skipped: 0. Indexed: 4088.

Yes, I also removed the counters for now.

Problem is that instance methods do not work with Pool.map without heavy workarounds.

Question: Should I run further in this direction? I think another evening and we are back on track. I also began writing some tests for MediaDirs, because I needed a flat file list for map to work.

Cheers,

Felix

tooxie commented 11 years ago

Man, I'm impressed :open_mouth: Great work!

I've been checking out your code, looks like the indexer has to be rewritten, we should be careful with that. To begin with, I've set up Travis to test with python 2.6 and 2.7. Looks good, once that's merged we'll have travis testing every PR.

Of course for that to work we need tests first :sweat_smile: I'll add some unit tests, but I still don't have clear how to test the indexer, I don't like the idea of including test music files in the project.

Anyway, going back to your multi-processing branch, have you thought of using a non-blocking network I/O framework, like Tornado? It has neat async features that may simplify things quite a bit.

Cool initiative! :+1:

felixhummel commented 11 years ago

Thanks! I'm looking forward to having Travis.

About testing: See #93.

I do not see the point of having non-blocking I/O. Four cores --> four long-running process via multiprocessing from stdlib. I find that simple enough for the indexing process.

Another story would be incremental indexing using inotify or the like.

felixhummel commented 11 years ago

Please have a look at https://github.com/felixhummel/shiva-server/blob/multi-processing/thoughts_about_the_indexer.rst and let me know what you think.

tooxie / shiva-server

Multi Processing Indexer #91