openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.03k stars 416 forks source link

SIGSEGV fault in trie_get_node when running the parser multi-threaded #102

Closed jamiehutton closed 7 years ago

jamiehutton commented 8 years ago

Hi there,

We are currently using libpostal + jpostal with spark to perform address parsing. We have hit an issue where libpostal SIGSEV errors when we have more than one executor.

We are really hoping someone might be able to look at this - currently we have a 12 hour address parsing process because we are only able to use one thread - and address parsing is something that is really well suited to multi-threading.

We have written a simple test case in Scala which shows the issue. From looking at the stack trace it shows the error is occurring in the C code (i.e. libpostal not jpostal) and it looks like it is within trie_get_node:

Stack: [0x00007f62a3eff000,0x00007f62a4000000],  sp=0x00007f62a3ffe438,  free space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libpostal.so.0+0x20404]  trie_get_node+0x14

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 329  com.mapzen.jpostal.AddressParser.libpostalParse(Ljava/lang/String;Lcom/mapzen/jpostal/ParserOptions;)[Lcom/mapzen/jpostal/ParsedComponent; (0 bytes) @ 0x00007f62b51ab9c0 [0x00007f62b51ab960+0x60]
J 361 C2 com.quantexa.etl.AddressParserTester$.parse(Ljava/lang/String;)[Lcom/mapzen/jpostal/ParsedComponent; (39 bytes) @ 0x00007f62b51bfea0 [0x00007f62b51bfc60+0x240]
J 366% C1 com.quantexa.etl.AddressParserTester$$anon$1.run()V (44 bytes) @ 0x00007f62b51c5ca4 [0x00007f62b51c5280+0xa24]
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub

Our test case is below. This is written in Scala but if hopefully it should be very obvious what is going on - the equivalent java code would be almost the same. If you need us to write this in java we should be able to. It is worth noting that the code below runs for a little while (20-30 seconds) before failing).

import com.mapzen.jpostal.AddressExpander
import com.mapzen.jpostal.AddressParser

object AddressParserTester {

  def main(args: Array[String]): Unit = {

    val address = "123 high street, london, sw11 6ln"

    for (i <- 0 until 8) {
      new Thread(threadParse(address, i)).start()
    }
  }

  def parse(fullAddressText: String) = {
    val e = AddressExpander.getInstance()
    val p = AddressParser.getInstance()
    val exp = e.expandAddress(fullAddressText)
    p.parseAddress(exp.head)
  }

  def threadParse(address: String, id: Int) = new Runnable() {
    def run(): Unit = {
      while (true) {
        println(id+": "+parse(address))
      }
    }
  }

}

Below is the full error log:

hs_err_pid8331.txt

albarrentine commented 8 years ago

Hi Jamie,

So firstly, the libpostal C library is definitely not thread-safe, and will likely will not be any time soon (the "social contract" with C is usually: if a library doesn't explicitly say it's thread-safe, don't use it from multiple threads).

In jpostal, all of the C calls are declared synchronized, which, as far as my understanding of Java's concurrency model goes, serializes calls to the JNI/C methods and prevents multiple threads from accessing them at the same time. So I'm a bit surprised to see threading bugs pop up on the Java/Scala side given that it's effectively single-threaded code. The idea there was simply to make jpostal safe to run in applications that are otherwise multithreaded, like a web server, etc. I can look into why that's happening, but generally if your goal is to parallelize libpostal calls, multithreading is not going to produce any speedup.

The first question to ask is: is libpostal single-threaded actually too slow? These are just from my back-of-the-napkin throughput tests and YMMV, but single-threaded libpostal should be able to do, worst case (long addresses in non-Latin scripts like Cyrillic or Han), about 5k addresses/sec on average with calls to both expand_address and parse_address. For Latin/ASCII addresses like the one above, the expand calls are about 3x faster than that (parsing is about the same). If it's taking 12 hours and libpostal is the only contributing factor in that, the data set would have to be at least 216M addresses per run. OpenAddresses globally is about that large, so it's not unheard of, but if the data set is smaller than that and there's more going on in the pipeline than just calls to libpostal, then it's probably a good idea to profile the entire single-threaded pipeline to see if there's anything else that can be improved.

The second question is: why not use multiple processes? There's often a bit of religion around this in the Java/JVM community that I don't particularly understand, not in the batch processing context anyway. I can certainly see a case for multithreading in libpostal, namely that there would only need to be one copy of the models in memory. Still, if the memory requirements are kept reasonably low (currently ~2GB which I'm trying to reduce further) such that multiple copies can be run on beefier single-purpose machines, I'm not sure the appeal of threading justifies the development/productivity cost.

If running multiple shared-nothing processes is not an option due to JVM weight, etc. the options may be a little more bleak i.e. continuing to run single-threaded or considering a different language like Python, which will also run with Spark but for which the multicore concurrency model is processes not threads.

./al

jamiehutton commented 8 years ago

Hi Al,

Thanks for getting back to me. I thought it would be useful to give you a bit more background on what we are doing as this provides a bit more context around the issue...

The data: We currently are processing a UK data set which has 65 million address records which need parsing. However we are about to get global data which we have been told contains 1bn records (it is not clear yet how many of these records will contain addresses but we can assume high hundreds of millions at least).

Our experience of Libpostal has been fantastic - we agree with you it is very fast and we love the way it works with so many different countries. The challenge for us is that the address parsing/cleansing is only one part of a much more complex set of data cleansing we are doing (in Scala on Apache Spark) - overall this is what takes the 12 hours on the 65m UK records. We are concerned that when we get the global data this process will end up becoming weeks.

However, one of the great things about data cleansing is that it is really easily parallelised. In spark there are the concept of executors and cores. These align with what you describe above as processes and threads. From your description above, it sounds like we should be able to parallelise using multiple executors (as these are different processes). However only being able to give each executor one core to work on is not the typical recommended way of running spark.

It was interesting that you expected Java to serialise calls to the JNI/C methods and prevent multiple threads from accessing them at the same time. Given the speed of libpostal it is fine that the call are serialised but there does seem to be some issues in the native code when we ramp up the threads. I suspect that we are hitting this issue simply because of the very high throughput of spark (and our little test rig above). We suspect that if you had a web server application with very high throughput you would see the same issue. If you were able to look into it that would be amazing but in the meantime we will try the multi-process approach.

When we do get the global data we may want to experiment with re-training the parser too as the data is given to us with some structure (city/state/postcode/country are all pre-parsed), but we can probably pick that up in another thread/offline!

Jamie

albarrentine commented 8 years ago

Ok, so it seems like that can be written as two different jobs with two different SparkConfs. The first job has num-executors=1, so effectively each machine is a single-threaded process (can still parallelize at the machine level), and just runs the addresses through libpostal and outputs them to a textFile. The second job does everything downstream from that and can use threads.

Re: the threading issue, two things came up while looking over the Java implementations again.

  1. The Singleton pattern used was not thread-safe. It now uses the lazy-initialization holder pattern which is both performant and thread-safe.
  2. Inspecting the bytecode from the compiled classes, it looks like monitorenter/monitorexit opcodes, which do the actual locking in the JVM, weren't getting produced by simply declaring the native methods synchronized. The two ways to do that seem to either be in a synchronized block on the Java side or with (*env)->MonitorEnter(obj) and (*env)->MonitorExit(obj) on the JNI/C side. I implemented the former as it's more standard and easier for people familiar with Java to read.

Try pulling the latest commit and let me know if that fixes the multithreaded test.

Re: re-training, I'm strongly recommending against training on private data sets, even large ones. Libpostal's designed to be an application-level library, not a generalized NLP utility for training custom parsers. There's nothing in the license that prohibits doing that, but I obviously don't want to be put in a position of solving everybody's data cleanup problems for free. That said, for researchers and others that want to dig in, this is one of the first open source NLP projects where a large quantify of training data is freely available and every step in the preprocessing pipeline is open-source, if fairly resource-intensive and sparsely-documented.

It should be fine to use the existing parser (there's also a much-improved version in development in the parser-data branch) on whichever portion of the address is not already pre-parsed in the 1B data set.

albarrentine commented 7 years ago

Assuming this fixed the threading bug?

There's a new singleton implementation in the Java binding which is also thread-safe (double-lock checking with volatile), but may want to check that this doesn't happen again. If not, I'll close the issue.

albarrentine commented 7 years ago

Closing

jamiehutton commented 7 years ago

Hi Al. Yes this all working perfectly. Thanks for your help Jamie