tjvr / kurt

Python library for reading/writing MIT's Scratch file format.
https://kurt.tjvr.org
GNU General Public License v3.0
86 stars 24 forks source link

Pickle support #19

Closed bboe closed 10 years ago

bboe commented 10 years ago

Added testing to verify simple projects can be pickled.

Note: At this time only pickle.HIGHEST_PROTOCOL is supported. To support older protocols, all classes that define getattr or slots need to be updated to include getdata and setdata. This requires changes or monkeypatches to construct.

tjvr commented 10 years ago

Interesting! What are you pickling objects for?

bboe commented 10 years ago

I want to cache to disk parsed scratch files to improve processing time over the same files. This PR allows pickling files, but it may not allow loading the pickled files (I neglected to test that).

bboe commented 10 years ago

Yeah, it doesn't unpickle properly. Hold off on this PR until I get a fix for that.

bboe commented 10 years ago

PR updated. Here's the reason why I want this support:

(hb)bboe@lappy2:Downloads$ time python -c 'import kelp.octopi; import kurt; kurt.Project.load("tmp.oct")'

real    0m10.232s
user    0m10.084s
sys 0m0.122s
(hb)bboe@lappy2:Downloads$ time python -c "import cPickle; cPickle.load(open('/tmp/hairball_cache.pkl'))"

real    0m1.242s
user    0m1.159s
sys 0m0.078s

Ideally the speed improvements should be within kurt itself, but I haven't the time to work on that.

bboe commented 10 years ago

Just as a heads up, I've done a bit of testing now with this pickling support. The speed-up is tremendous.

https://github.com/ucsb-cs-education/hairball/compare/cache

I'm going through and pre-kurt loading all of my data which will take about three hours for the 1200 files. With the pickled cache I have added to my library (depends on this version of Kurt) I can process the already-cached portion of the dataset in only seconds.

tjvr commented 10 years ago

The speed-up is tremendous.

Yes! Kurt's parser isn't terribly efficient.

I need to have a proper look at your pickling patch, and make sure it doesn't break stuff.

By the way, do your projects have large images? I think the 1.4 image-parsing code is possibly the bottleneck, and so needs rewriting anyway. If you could profile it to check, that'd be great.

bboe commented 10 years ago

By the way, do your projects have large images? I think the 1.4 image-parsing code is possibly the bottleneck, and so needs rewriting anyway. If you could profile it to check, that'd be great.

I don't think the images are particularly large. Regardless, I have my band-aid fix needed in order to more efficiently perform what I am working on. Thus I am respectfully going to decline your request for profiling.

tjvr commented 10 years ago

Thus I am respectfully going to decline your request for profiling.

No worries! I should've clarified: I'll certainly merge your PR, once I've tested it. :)

tjvr commented 10 years ago

Sorry for taking so long to merge this. (I got busy...)

Out of interest, does removing line 143 of scratch14/init.py solve the problems with requiring pickle.HIGHEST_PROTOCOL? It's only necessary for debugging.

bboe commented 10 years ago

Out of interest, does removing line 143 of scratch14/init.py solve the problems with requiring pickle.HIGHEST_PROTOCOL? It's only necessary for debugging.

I doubt it. Just out of curiosity, is there any other way to access the save history information other than through that attribute?

tjvr commented 10 years ago

is there any other way to access the save history information

No, but feel free to raise an issue!