Bad performance creating documents, due to the Nagle algorithm. Solution: TCP_NODELAY

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Create CouchDB documents sequentially using the POST method, without delay, 
in one thread of execution.

What is the expected output? What do you see instead?
With curl we are able to do around 100 docs/s, but with couchdb-python the 
limit is around 12 docs/s. The problem is the Nagle algorithm (see links below)

It should be possible to pass the option TCP_NODELAY to the underlying socket 
connecting to the couchdb database.

What version of the product are you using? On what operating system?

CouchDB 0.8 and 0.9dev
httplib2 0.7.1 and 0.6.0

Please provide any additional information below.

http://code.google.com/p/httplib2/issues/detail?id=28
http://code.google.com/p/httplib2/issues/detail?id=91
http://www.cmlenz.net/archives/2008/03/python-httplib-performance-problems (see 
comment by Evan Jones)

Original issue reported on code.google.com by daniel.g...@wavilon.com on 4 Aug 2011 at 10:06

GoogleCodeExporter commented 8 years ago

The operating system is Ubuntu 10.04.3 LTS, Linux 2.6.32-33-generic #70-Ubuntu 
SMP Thu Jul 7 21:09:46 UTC 2011 i686 GNU/Linux

Original comment by daniel.g...@wavilon.com on 4 Aug 2011 at 10:08

GoogleCodeExporter commented 8 years ago

Correction: I realized that couchdb-python is not using httplib2, but httplib.
httplib is integrated into python: in my system I have python 2.6.5.

Original comment by daniel.g...@wavilon.com on 4 Aug 2011 at 11:18

GoogleCodeExporter commented 8 years ago

I have tried couchdb-python-curl (1.0.14p2, 
http://code.google.com/p/couchdb-python-curl) and it is solving my problems. I 
get with it 150 documents/s.

couchdb-python-curl is a fork of couchdb-python (somewhat buggy, I had to 
correct a couple of small errors), but using pycurl instead of httplib seems to 
increase performance by at least an order of magnitude.

Original comment by daniel.g...@wavilon.com on 4 Aug 2011 at 11:23

GoogleCodeExporter commented 8 years ago

Confirmed on Gentoo Linux 3.0.0 against CouchDB 1.1.0 release using 
couchdb-python from tip.

Python interpreters:
python-2.4.5 (simplejson with C ext)
python-2.7.2
pypy-1.5 (jit) 

All of them have showed same result 12 doc per second for saving 1K simple 
documents.
Test script, cProfile stats attached and report are attached.

Looks like we're holded somewhere within _socket module. PyCURL based on 
another C library that is optimized for HTTP protocol, that's why result could 
be different.

But instead of doing such benchmarks, better to use specific API that is 
suitable for modifying a lot of documents at once:
http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API#Modify_Multiple_Documents_
With_a_Single_Request

Original comment by kxepal on 5 Aug 2011 at 5:39

Attachments:

GoogleCodeExporter commented 8 years ago

Regarding bulk inserts: my application is creating documents on the fly, based 
on external events. I have no control on how fast those events are happening. 
They could be coming with a rate of 100 events/s, or with a rate of 1 
event/minute.

Using bulk inserts, I would gather these documents in a list, and when a 
certain threshold is reached, I would send them as a Bulk request to couchdb.

The easy solution is to implement the threshold based on a number of documents. 
The problem is that, in this approach, slow events will pile up in my list, and 
will be sent to couchdb much later. The latency of my application will be very 
big; indeed, the latency is unbounded, since nobody can guarantee that the 
event which helps us to reach the threshold is generated.

The best solution would be to implement a combination of time-based and 
quantity based threshold, say each 1s, maximum 100 documents. I am certain that 
this aproach would solve my problems, and probably even increase the maximum 
throughput over the 150 docs/s that I am reaching with couchdb-python-curl.

But suddently, a very simple application has gotten much more complicated, 
where I have to fire timers, and implement a threshold algorithm somehow 
difficult. It is probably the right way to go, but makes simple applications 
suffer unnecessarily of a very small throughput, compared to other couchdb 
libraries out there.

Original comment by daniel.g...@wavilon.com on 5 Aug 2011 at 7:35

GoogleCodeExporter commented 8 years ago

I have implemented a bulk insert using a threshold based on a number of 
documents (100), and these are my new metrics:

100000 entries, 63.926536 seconds, 1564.295614 entries/s

Which is over 2 orders of magnitude of improvement compared to my original 
implementation.

I still have to solve the latency problem with timers, but the improvement is 
impressive!

Original comment by daniel.g...@wavilon.com on 5 Aug 2011 at 8:04

GoogleCodeExporter commented 8 years ago

Yes, I see your problem well. 12dps is too low amount of data, so I'd like to 
investigate why that's happened. As start point I've found Python issue about 
same thing: http://bugs.python.org/issue3766

Original comment by kxepal on 5 Aug 2011 at 8:07

GoogleCodeExporter commented 8 years ago

Ok, adding after conn.connect() line
>>> conn.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
improved speed to 23 docs per second for me. Twice faster, but not so 
overwhelming(:

Original comment by kxepal on 5 Aug 2011 at 9:01

GoogleCodeExporter commented 8 years ago

I have implemented the solution with the threshold combination timer / 
nr_documents, and it is working fine. I am getting up to 5000 docs/s, depending 
on how big the buffering of documents is made. I have experienced that the best 
throughput is reached when buffering around 1000 documents, and further 
buffering just flattens the curve.

Your milleage may vary, probably depending on the size of the documents that 
you are using.

I have set the timeout at around 0.5s, so that I have a reasonably low latency. 
I think I can live with that, specially big such a great throughput.

Original comment by daniel.g...@wavilon.com on 5 Aug 2011 at 10:57

GoogleCodeExporter commented 8 years ago

@daniel
You're talking about curl solution? Just to make things clear.

Original comment by kxepal on 5 Aug 2011 at 11:02

GoogleCodeExporter commented 8 years ago

No, I have reverted to using couchdb-python. Now, my "create" routine buffers 
the documents until the threashold is reached, or the timer expires. Find 
attached the implementation:

Original comment by daniel.g...@wavilon.com on 5 Aug 2011 at 11:07

Attachments:

bulk_send.py

GoogleCodeExporter commented 8 years ago

Hmmm, great results!
By using pure sockets I've got rate 530 docs per sec, but as far as I would 
add(because I should do it) http-specific functions and checks this rate will 
get lower and lower. Finally, this experiment will produce yet another httplib 
with very uncertain prospects. 

I suppose all that could be done there on couchdb-python side is adding 
socket.TCP_NODELAY option to improve performance somehow if this wouldn't 
create any problems on other OS.

Original comment by kxepal on 5 Aug 2011 at 11:37

GoogleCodeExporter commented 8 years ago

@kxepal: probably it is not advisable to release the changes, unless you are 
very comfortable with the implementation. As I understand from your previous 
comments, you are hacking a bit the lower layer of couchdb-python. Maybe it 
would be easier to use another library instead of httplib? What about PyCurl, 
as couchdb-python-curl is doing? I am no longer using it, since it seems to be 
an inactive project, but the tests I made were improving my metrics. I must say 
that it was quite buggy, with obvious errors all around the code.

As you mentioned before, the real solution is to do bulk updates, and the very 
low performance of single inserts will force any user to walk the one and only 
path. :)

It is nevertheless very frustrating for novices to see such an abysmal 
performance of single inserts, specially compared with other tools and other 
libraries. I got my early performance numbers from ab (apache benchmark) and 
curl (the binary, in a loop), and both are much more performant than 
couchdb-python. I do not know about the internal implementation of ab, but curl 
is for sure doing single inserts, since I am just creating a new process for 
each executable - and even with that overhead, it was beating couchdb-python 
easily.

Original comment by daniel.g...@wavilon.com on 5 Aug 2011 at 2:09

GoogleCodeExporter commented 8 years ago

> probably it is not advisable to release the changes, unless you are very 
comfortable with the implementation.
But it could produce nice line in change log:
- Now documents are saving twice faster!
(;

> What about PyCurl, as couchdb-python-curl is doing? 
That's interesting solution and may be, at least I, will use it in very high 
load project, after refactoring of couchdb.http.Session.request method first.

Currently, as for me, I've never suffered from this issue due to using bulk 
updates for large amount of documents and/or task queue with worker processes 
pool to handle a lot of data sources. That was just interesting to get to know 
why things works as they are and what could be done to change situation.

Anyway, the final decision what to do for Matt and Dirkjan(:

Thank you for sharing your experience, @daniel!

Original comment by kxepal on 5 Aug 2011 at 4:23

GoogleCodeExporter commented 8 years ago

Related discussion at CouchDB user mailing list.
http://thread.gmane.org/gmane.comp.db.couchdb.user/14921/focus=14921

Original comment by kxepal on 22 Aug 2011 at 8:00

GoogleCodeExporter commented 8 years ago

I ran through this same issue. The problem is the Nagle algorithm, but the 
correct fix is not to disable it with setsockopt(), as this may have other 
consequences for the network and is slightly unportable.

The correct approach is to send HTTP headers and body in a single packet when 
possible. This can be achieved with the following patch:

diff --git a/couchdb/http.py b/couchdb/http.py
--- a/couchdb/http.py
+++ b/couchdb/http.py
@@ -261,22 +261,34 @@
                     time.sleep(delay)
                     conn.close()

+        def _send_headers_and_body(body):
+            # Send the headers and body in a single packet to avoid
+            # slowdown caused by deleayd ACK and the Nagle algorithm.
+            # See issue #193.
+            if sys.version_info < (2, 7):
+                conn.endheaders()
+                conn.send(body)
+            else:
+                conn.endheaders(body)
+
         def _try_request():
             try:
                 conn.putrequest(method, path_query, skip_accept_encoding=True)
                 for header in headers:
                     conn.putheader(header, headers[header])
-                conn.endheaders()
                 if body is not None:
                     if isinstance(body, str):
-                        conn.send(body)
+                        _send_headers_and_body(body)
                     else: # assume a file-like object and send in chunks
+                        conn.endheaders()
                         while 1:
                             chunk = body.read(CHUNK_SIZE)
                             if not chunk:
                                 break
                             conn.send(('%x\r\n' % len(chunk)) + chunk + '\r\n')
                         conn.send('0\r\n\r\n')
+                else:
+                    conn.endheaders()
                 return conn.getresponse()
             except BadStatusLine, e:
                 # httplib raises a BadStatusLine when it cannot read the status

The message_body argument to HTTPConnection.endheaders() is undocumented, but I 
believe it appeared in Python 2.7. I'll make sure it is added to httplib's 
documentation.

Original comment by akhern on 30 Sep 2011 at 7:57

GoogleCodeExporter commented 8 years ago

Forgot to say: This approach boosted the performance on my machine by a factor 
of 8, from 20 docs/sec to 160 docs/sec.

Original comment by akhern on 30 Sep 2011 at 8:00

GoogleCodeExporter commented 8 years ago

@akhern, nice found!

Results for me(Gentoo linux 3.0.4, CouchDB 1.1.0) using test script[1] for 
10000 docs:
Python 2.7:
 default options: ~22 dps
 default options + patch: ~45 dps
 patch + server nodelay: ~230 dps
 patch + server nodelay + client nodelay: ~200 dps

Another results is for Python 2.4:

 default options: ~22 dps
 default options + patch: ~22 dps
 patch + server nodelay: still ~22 dps
 patch + server nodelay + client nodelay: ~220 dps (sic!)

PyPy shares Python 2.7 results.

[1] - 
http://code.google.com/p/couchdb-python/issues/attachmentText?id=193&aid=1930004
000&name=test.py&token=30199306482030f1894eecc4e5d831d9

Original comment by kxepal on 30 Sep 2011 at 9:01

GoogleCodeExporter commented 8 years ago

Just for the record, the message_body argument of endheaders() is now properly 
documented:

http://docs.python.org/library/httplib.html

Original comment by akhern on 6 Oct 2011 at 3:04

GoogleCodeExporter commented 8 years ago

@akhem thanks for the patch and nice find, I had no idea endheaders took an 
optional arg in 2.7+. I've applied a slightly modified version of the patch. 
Unfortunately, I forgot to attribute the commit to you - really sorry about 
that!

Note: I also added a really simple performance testing script, perftest.py, to 
help spot any regressions or just to get a quick overview of performance across 
different platforms & version.

Original comment by matt.goo...@gmail.com on 9 Oct 2011 at 11:52

GoogleCodeExporter commented 8 years ago

@Matt,

Thanks for test script, I'll try to run tests against various situations now.
But, it wouldn't work with Python 2.4 due to finally statement should be alone. 
Patch attached.

Original comment by kxepal on 9 Oct 2011 at 12:09

Attachments:

perftest.diff

GoogleCodeExporter commented 8 years ago

Using a "slightly" improved perftest.py to add nodelay patch I've got next 
results:

C:\Documents and Settings\ash\projects\couchdb-python>python perftest.py -c 
10000
sys.version : '2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit 
(Intel)]'
sys.platform : 'win32'
server.version : u'1.1.0'
* [create_bulk_docs_nodelay] Create lots of docs, lots at a time ... 1862.34s 
(5.37s rps)
* [create_doc] Create lots of docs, one at a time ... 55.31s (180.79s rps)
* [create_doc_nodelay] Create lots of docs, one at a time with setup nodelay 
... 57.88s (172.79s rps)
* [create_bulk_docs] Create lots of docs, lots at a time ... 1666.92s (6.00s 
rps)

kxepal@ashdarh ~/projects/couchdb-python/default $ python2.4 perftest.py -c 
10000
sys.version : '2.4.6 (#1, May 26 2011, 00:41:47) \n[GCC 4.4.5]'
sys.platform : 'linux2'
CouchDB : '1.1.0'
* [create_bulk_docs_nodelay] Create lots of docs, lots at a time ... 1404.21s 
(7.12s dps)
* [create_doc] Create lots of docs, one at a time ... 445.13s (22.47s dps)
* [create_doc_nodelay] Create lots of docs, one at a time with setup nodelay 
... 44.08s (226.86s dps)
* [create_bulk_docs] Create lots of docs, lots at a time ... 1385.36s (7.22s 
rps)

kxepal@ashdarh ~/projects/couchdb-python/default $ pypy-c1.5 perftest.py -c 
10000
sys.version : '2.7.1 (?, Aug 03 2011, 16:22:48)\n[PyPy 1.5.0-alpha0 with GCC 
4.4.5]'
sys.platform : 'linux2'
server.version : u'1.1.0'
* [create_bulk_docs_nodelay] Create lots of docs, lots at a time ... 1546.17s 
(6.47s rps)
* [create_doc] Create lots of docs, one at a time ... 36.96s (270.60s rps)
* [create_doc_nodelay] Create lots of docs, one at a time with setup nodelay 
... 41.03s (243.72s rps)
* [create_bulk_docs] Create lots of docs, lots at a time ... 1546.21s (6.47s 
rps)

kxepal@marifarai ~/couchdb-python $ python2.7 perftest.py -c 10000
sys.version : '2.7.2 (default, Sep 25 2011, 18:21:53) \n[GCC 4.5.3]'
sys.platform : 'linux2'
server.version : '1.1.0'
* [create_bulk_docs_nodelay] Create lots of docs, lots at a time ...
771.20s (12.97s rps)
* [create_doc] Create lots of docs, one at a time ... 24.80s (403.18s rps)
* [create_doc_nodelay] Create lots of docs, one at a time with setup nodelay 
... 30.85s (324.12s rps)
* [create_bulk_docs] Create lots of docs, lots at a time ... 750.10s (13.33s 
rps)

kxepal@marifarai ~/couchdb-python $ python2.6 perftest.py -c 10000
sys.version : '2.6.7 (r267:88850, Sep 25 2011, 23:07:39) \n[GCC 4.5.3]'
sys.platform : 'linux3'
server.version : u'1.1.0'
* [create_bulk_docs_nodelay] Create lots of docs, lots at a time ... 1426.12s 
(7.01s rps)
* [create_doc] Create lots of docs, one at a time ... 427.35s (23.40s rps)
* [create_doc_nodelay] Create lots of docs, one at a time with setup nodelay 
... 27.23s (367.28s rps)
* [create_bulk_docs] Create lots of docs, lots at a time ... 1550.66s (6.45s 
rps)

WARNING: this perftest.py assumes that there is no socket_options defined in 
CouchDB config by default. Test results may be different if it is.

Original comment by kxepal on 9 Oct 2011 at 5:35

Attachments:

perftest.py

GoogleCodeExporter commented 8 years ago

Never mind for python2.4 test result, I just forgot to fix output: first I've 
run it, later I've changed output strings.

Original comment by kxepal on 9 Oct 2011 at 5:40

GoogleCodeExporter commented 8 years ago

@kxepal thanks for the Python 2.4 fix. Committed.

Original comment by matt.goo...@gmail.com on 11 Oct 2011 at 9:48

GoogleCodeExporter commented 8 years ago

Just committed a workaround that avoids Nagle's algorithm for supported Pythons 
<2.7. It also removes the need for the previous 2.7-specific fix, simplifying 
the real code a little.

Original comment by matt.goo...@gmail.com on 20 Oct 2011 at 1:31

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Just in case anyone comes across this issue again ... you still need to set the 
nodelay option for the CouchDB server to get good performance, e.g.

  [httpd]
  socket_options = [{nodelay, true}]

Original comment by matt.goo...@gmail.com on 20 Oct 2011 at 1:38

lanto03 / couchdb-python

Bad performance creating documents, due to the Nagle algorithm. Solution: TCP_NODELAY #193