openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.04k stars 418 forks source link

Make libpostal thread safe #34

Open thedrow opened 8 years ago

thedrow commented 8 years ago

I'm following up on https://github.com/openvenues/pypostal/issues/3 Making libpostal threadsafe will allow libraries for interpreted languages such as Python and Ruby to release the GIL in order to enable parallelism. It will also allow the usage of libpostal in threaded C/C++ applications.

What isn't threadsafe in libpostal and what's required in order to ensure thread safety?

brianmacy commented 7 years ago

I have actually just seen an issue where I'm getting inconsistent/whacked parses in a multi-threaded process. If I use single threaded or put a static mutex around the libpostal address_parse* call sequence I get consistent parses.

albarrentine commented 7 years ago

@brianmacy libpostal is not thread-safe and is not advertised as such, so like any other C function that is not thread-safe, a mutex or other synchronization is required so only one thread can call the function at a time.

brianmacy commented 7 years ago

No problem. Just seem to be a question. Do we know what specific uses/code is not threadsafe?

albarrentine commented 7 years ago

address_parser_module_setup is definitely not thread-safe, needs to be called once from one thread at the beginning of your program.

There are global variables in libpostal but they're read-only, context and response structs are allocated as part of the address_parser_parse call, so not sure what would be causing the inconsistencies off the top of my head. Feel free to submit pull requests but since the vast majority of libpostal's users are calling from a single-threaded environment, multithreading is not really a priority.

veremenko-y commented 6 years ago

@albarrentine can I suggest to add to README.md that library is not thread safe? I was struggling with heap corruption yesterday until I realized that.

brianmacy commented 2 years ago

Just going to follow up on this as single-threaded address parsing ended up being the main performance bottleneck for me. Due to the size of the model loaded into memory, the fewer processes the better. This is a preliminary patch that seems to resolve it for me with the CRF model (should I be using the other???) by creating thread local versions of the problematic global context. I'm likely not done with this but wanted to share for feedback.

And yes, I know it leaks memory for each thread but that isn't a major concern for my use case as threads only shut down on exit. The benefit right now is that I didn't need to add thread init/destroy calls. This isn't a PR request cause it isn't really ready.

diff --git a/src/address_parser.c b/src/address_parser.c
index 3157055..d9e0cc9 100644
--- a/src/address_parser.c
+++ b/src/address_parser.c
@@ -1637,7 +1637,15 @@ bool address_parser_predict(address_parser_t *self, address_parser_context_t *co
     if (self->model_type == ADDRESS_PARSER_TYPE_GREEDY_AVERAGED_PERCEPTRON) {
         return averaged_perceptron_tagger_predict(self->model.ap, self, context, context->features, context->prev_tag_features, context->prev2_tag_features, token_labels, feature_function, tokenized_str, self->options.print_features);
     } else if (self->model_type == ADDRESS_PARSER_TYPE_CRF) {
-        return crf_tagger_predict(self->model.crf, self, context, context->features, context->prev_tag_features, token_labels, feature_function, tokenized_str, self->options.print_features);
+        static __thread crf_t *_crf_model_thread = NULL;
+       if (_crf_model_thread == NULL) {
+            _crf_model_thread = calloc(1, sizeof(crf_t));
+           memcpy(_crf_model_thread,self->model.crf,sizeof(crf_t));
+           _crf_model_thread->viterbi = uint32_array_new();
+           _crf_model_thread->context =  crf_context_new(CRF_CONTEXT_VITERBI | CRF_CONTEXT_MARGINALS, self->model.crf->num_classes, CRF_CONTEXT_DEFAULT_NUM_ITEMS);;
+       }
+
+        return crf_tagger_predict(_crf_model_thread, self, context, context->features, context->prev_tag_features, token_labels, feature_function, tokenized_str, self->options.print_features);
     } else {
         log_error("Parser has unknown model type\n");
     }
@@ -1658,7 +1666,11 @@ libpostal_address_parser_response_t *address_parser_parse(char *address, char *l
         return NULL;
     }

-    address_parser_context_t *context = parser->context;
+    /*address_parser_context_t *context = parser->context;*/
+    static __thread address_parser_context_t *context = NULL;
+    if (context==NULL)
+        context = address_parser_context_new();
+

     char *normalized = address_parser_normalize_string(address);
     bool is_normalized = normalized != NULL;