tsegall / fta

Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.
Apache License 2.0
24 stars 2 forks source link

TextAnalyzer not Serializable #77

Open robgratz29 opened 6 months ago

robgratz29 commented 6 months ago

Tim, question, is there a reason we can't make the TextAnalyzer Serializable? Reason I'm asking is that I'm trying to hook up the FTA stuff into Spark as a custom aggregator. Problem is, the "accumulator" is the TextAnalyzer and it has to be serializable. I've gotten around it by storing the marshalled JSON representation of the analyzer, but performance grinds to a halt having to marshall and unmarshall for every row processed.

So I guess my issue is to make the TextAnalyzer instance Serializable.

Thanks, Rob

tsegall commented 5 months ago

Rob,

Not quite sure whether to call this an enhancement or a bug :-).

However, firstly there is a serialization() and deserialization() on the TextAnalyzer(). You should be able to use these rather than rolling your own. Have a look at the new test exerciseSerialization() in TestMerge.java.

However, serialization() and deserialization() in particular are slooow. The good news is that with the latest release 15.5.2 serialization() performance has improved 15x - so from 662μs -> 46μs. I also improved deserialization() from 2562μs -> 1729μs - a much more modest improvement, so you are still looking at ~2ms to deserialize().

Historically all the focus has been on making train() as fast as possible. I am not sure how fast you need it to be?

Inherently deserialize() is going to be significantly slower, OTOH it seems to me I should be able to make it somewhat faster.

Do you have sense what would be required to be 'acceptable'?

Regards, Tim.

robgratz29 commented 5 months ago

I have a workaround that gets by the problem. I wrapped the TextAnalyzer in a class that implements Externalizable then do the serialize/deserialize calls there. This way I only have to make those calls when the object is serialized/deserialized rather than holding onto the json representation which required a serialize/deserialize every time it was used. You can close out this bug/enhancement if you like.