Closed MichaelCurrie closed 8 years ago
Do you need to rebase off master to get the pep8 fixes that will let the CI pass?
It was another pep8 fix I needed... but now it's passing for Python 2.7 and 3.5 but not 3.4. Gah. Now I have to make a machine that has Python 3.4
The 75 MB Kerr test file with dozens of worms was causing a huge memory footprint, because each segment was getting appended to a master dataframe and I guess copies were being made. Also, pandas is slow when you successively append like that. So I created a special case when all data segments are from distinct worms, the "sub data frames" are combined into one dataframe in one merge step via reduce
. That speeds things up.
The memory footprint was over 3 GB, causing a "Killed" message on Travis CI, although it was fine on my machine with 8 GB. But now htop
and testing leads me to believe it stays under 3 GB.
With f20ac8a,
Switched internally to an ordered dictionary representation of the worms, with they keys the worm ids, rather than a sparse DataFrame. the sparse DataFrame is still available but is lazily computed only when requested.
Speed improvements:
For Rex's 90-worm file:
8 seconds to load 27 seconds to save
(fixed from 180+ seconds to load)
memory footprint is now ~1 GB, down from 4 GB+ with the sparse DataFrame representation. This memory improvement means the Travis CI tests pass, because those machines have a maximum of 3 GB RAM.
(closing this in order to rebase this branch to master and then create a new PR)
Some additonal speed fixes relevant for issue #55 made now that I am testing @Ichoran's test file. Basically I hadn't tested the use case of multiple worms tracked in a single file. It turned out the
df_upsert
was quite slow. It's now quite a bit faster (from ~3.5 minutes to load down to ~27 seconds for the 75 MB @Ichoran file)