openworm / tracker-commons

Compilation of information and code bases related to open-source trackers for C. elegans
12 stars 12 forks source link

Python speedfix 2 #68

Closed MichaelCurrie closed 8 years ago

MichaelCurrie commented 8 years ago

Some additonal speed fixes relevant for issue #55 made now that I am testing @Ichoran's test file. Basically I hadn't tested the use case of multiple worms tracked in a single file. It turned out the df_upsert was quite slow. It's now quite a bit faster (from ~3.5 minutes to load down to ~27 seconds for the 75 MB @Ichoran file)

Ichoran commented 8 years ago

Do you need to rebase off master to get the pep8 fixes that will let the CI pass?

MichaelCurrie commented 8 years ago

It was another pep8 fix I needed... but now it's passing for Python 2.7 and 3.5 but not 3.4. Gah. Now I have to make a machine that has Python 3.4

MichaelCurrie commented 8 years ago

The 75 MB Kerr test file with dozens of worms was causing a huge memory footprint, because each segment was getting appended to a master dataframe and I guess copies were being made. Also, pandas is slow when you successively append like that. So I created a special case when all data segments are from distinct worms, the "sub data frames" are combined into one dataframe in one merge step via reduce. That speeds things up.

The memory footprint was over 3 GB, causing a "Killed" message on Travis CI, although it was fine on my machine with 8 GB. But now htop and testing leads me to believe it stays under 3 GB.

MichaelCurrie commented 8 years ago

With f20ac8a,

Switched internally to an ordered dictionary representation of the worms, with they keys the worm ids, rather than a sparse DataFrame. the sparse DataFrame is still available but is lazily computed only when requested.

Speed improvements:

For Rex's 90-worm file:

8 seconds to load 27 seconds to save

(fixed from 180+ seconds to load)

memory footprint is now ~1 GB, down from 4 GB+ with the sparse DataFrame representation. This memory improvement means the Travis CI tests pass, because those machines have a maximum of 3 GB RAM.

MichaelCurrie commented 8 years ago

(closing this in order to rebase this branch to master and then create a new PR)