performance analysis - large XML documents

balmas commented 10 years ago

This is going to be an ongoing issue to document findings and solutions to performance problems with large XML files.

First tests used treebank documents ranging in size from 22K to 1MB and the arethusa perseidsPerf test module which time the results of issuing multiple simultaneous requests for different documents from the same browser. The POST from arethusa updated the entire file each time.

response times ranged from .6s for the smallest document to 6s for the largest when POSTed without any other requests competing for resources. Response times more than doubled with 10 requests issued simultaneously (tripling for the larger documents). Network latency and browser bandwidth accounts for some of this (discrepancies between server reported response times and browser reported response times varied by 1-3 seconds). Server CPU utilization spiked at >150%

Based on prior similar problems, I originally hypothesized the peformance degradation was related to the system calls by the Grit module to git, but testing with the Rails3 branch and jGit didn't show any change in behavior. For this particular test of the Treebank files git is not the bottleneck I don't think.

Next test was to see if throwing hardware at it would help -- the sosol app server is underpowered CPU-wise, running a memory-optimized server with 16GB ram but only 2 CPUs. I tested with various different AWS instance configurations (ranging from 4-8 cpus and 16-30gb ram) The best performing was a cpu-optimized instance at 8cpus/16gb RAM. This made sosol performance generally very snappy but didn't help significantly with the POSTS of multiple simultaneous large files.

I then benchmarked the behavior of each step of a commit of a Treebank file. This was a bit illuminating to start. Whenever a file is updated we currently parse it as an REXML document and then pretty-print it for consistent formatting. This consumed nearly half of the response time, and a quarter or more of the rest came from the test on valid xml which runs the file through a transform to renumber it if necessary and validates it against the schema.

Although it's arguable whether or not we need to parse the XML of document when the entire document is being sent, we will still need to parse for partial updates, so getting rid of this step isn't entirely viable. Testing with Nokogiri instead of REXML indicates that Nokogiri performs much much better (down to 40ms).

It's clear that the whole XML parsing code of the treebank identifier (and other cite identifiers including alignment and oac) need to be optimized.

However, this might just push the problem further down the chain. With the bottlenext on parsing removed, a nearly equivalent delay seems to be coming in from elsewhere in between the time the document is updated and when it is rendered for the response. (This requires further evaluation on an isolated system).

LFDM commented 10 years ago

The atrocious performance of REXML is no suprise - it has a history of performing badly - and it gets worse and worse with larger documents.

Making the switch the Nokogiri (which is probably the fastest xml parsing implementation available for jruby [although for specific tasks faster solutions exist, Nokogiri's performance is the most consistent]) is generally a very good idea. It's API is also a lot nicer and can help with optimizing things further down the road.

I'd start with refactoring all REXML statements by hiding all calls to it behind an own utility class that handles all XML operations in SoSOL. Once this is done, we can replace the XML parsing unit in this class with Nokogiri.

Depending on the task at hand, different strategies should probably be used to optimize performance more. E.g. saving the document - if in this case the goal is only reformatting the xml string, it's probably best to avoid Nokogiri or a similar XML parser altogether, because allocating the complete DOM tree adds much overhead that's not needed in such a case, as this DOM tree is immediately thrown away anyway, without having done any real work on it. Could be that forking another process and calling a fast C parser directly will probably perform even better for such a task. something like xmllint --format which is available from the package libxml2-utils. I don't really know if that will be considerably faster though, but is probably worth trying.

Other tasks like pulling out specific metadata info out of the XMLs, such as the annotator name, will perform better if we retrieve this info with a SAX parsing style - Nokogiri's API for this is extremely simple and a joy to use.

Generally, abstracting all XML code first is probably the first step - will be much easier to do serious performance testing then.

For the last paragraph - I'd probably be hesitant to put the blame on SoSOL here, it still could be that processing of large responses on the client-side add some overhead that distorts the results a bit. The impact of this was largely cut down by some modifications I did to some angular code, but it won't work without overhead at all. Would need to look at this in more detail again to which degree this is really an issue.

balmas commented 10 years ago

I agree completely in principle with abstracting all XML parsing throughout SoSOL. In reality it will take quite a while to implement this and require many more unit tests than we have right now.

So, since treebanking is the priority for the fall, I'm starting with refactoring XML in the treebank cite identifier class. Will move this to a general XML parsing class for sosol as soon as the fall semester priorities are tackled.

perseids-project / perseids_docs

performance analysis - large XML documents #160