jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Free cached NDArrays while processing multi-tree models #83

Closed danarmak closed 6 years ago

danarmak commented 6 years ago

When converting a multi-tree model, free each tree's cached NDArrays when we finish processing it. This greatly reduces the intermediate memory use of the PMMLPipeline when the underlying PKL model uses multiple files.

This is a followup of a discussion on the mailing list.

I'm not sure if this needs similar changes anywhere else in the code, perhaps for other model types.

I'm going to be on vacation next week, so if this PR requires nontrivial changes and you don't want to do them yourself, I won't be able to make them until the week of October 1st, but I will reply here and on the mailing list to any messages.

vruusmann commented 6 years ago

I would probably throw in a marker interface of some sort. Perhaps org.jpmml.sklearn.Loadable, which would provide methods #load() and #unload(). Or maybe Loadable is too specific, and in future versions it could be generalized to LoadableResource or similar.

Also, I really like having utility methods. So whatever the marker interface name is going to be, there would be a corresponding *Util class, which would take care of traversing object trees (eg. subclasses of org.jpmml.sklearn.(Py)ClassDict) and performing "deep" resource free operations.

TLDR: I assume that this code change solves your problem to a material degree. So, I shall try to adhere to its ideas, but will probably do some refactorings/generalizations around it.

danarmak commented 6 years ago

You're certainly welcome to expand and refactor this however you like, as long as the basic behavior of releasing the data in each tree in sequence remains. That's all I need.

Do I understand you correctly that you're willing to do this yourself? If so, thanks! As I said, I won't have time to work on this further until October in any event.

vruusmann commented 6 years ago

Do I understand you correctly that you're willing to do this yourself?

Your report/feedback suggests that resource "unloading" solves an important (usability-) problem, so I'd really like to take a few days to think about the design (shallow vs. deep unloading, possibly with a related configuration option), and then do my own implementation.

Maybe the Google Guava library provides some lazy-loading, eager-unloading API.

danarmak commented 6 years ago

Great, thank you!