mickorz / rtmplite

Automatically exported from code.google.com/p/rtmplite
1 stars 0 forks source link

CPU charge question #31

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
- siprtmp takes without calls about 2% of CPU and with one call about 10% with 
a nocona 2 x Xeon
This is a little weird since there is no audio/video encoding. 
is this the effect of RTMP to RTP loop ?

Original issue reported on code.google.com by Boophone...@gmail.com on 8 Feb 2011 at 11:08

GoogleCodeExporter commented 9 years ago
A tried to trackdown this issue too. CPU load is unnaturally too high. I did 
profiling on siprtmp. The load is located in multitasked socket reading. In my 
humble opinion problem is located in p2p-sip packet, in file rfc3550.py, class 
Network, function receiveRTP, line:

data, remote = yield multitask.recvfrom(sock, self.maxsize)

There is unbuffered rtp reading and multitasking is fireing reads much to 
offen. I thought that writing buffer for rtp like it is done in rtmp part would 
fix the problem, but it seems to be impossible to work well in current 
architecture due to problems:

- in multitasking reading more then one DATAGRAM packet per task when packets 
are coming one per about 20ms would block the task machine causing other rtp 
read tasks to hang, with would lead to bad audio quality in live stream 
building up quickly with a number of connections.

- if one some would somehow solve multitask problem, the new issue spikes. 
Python sockest lacks one packet read function (I found plugin that is able to 
workaround this problem). So, It is possible to read fragmented packet while 
buffering (due to some funny audio quality issues with I had I suspect that it 
happens even now without the buffering but I havent got time to prove it). An 
option is to write payload parser, but it will be dependent of stream payload 
type, and will be complex.

- one can thing that it can be fixed by disallowing to fire read task more 
offen than X miliseconds wher X is lover than 20 by checking delay since last 
read. But, this startegy will fail due to above problems with fragmented 
packets, with should be fixed in first place

Original comment by Lukasz.P...@gmail.com on 16 Mar 2011 at 12:18

GoogleCodeExporter commented 9 years ago
it seems to be hard to resolve it, since if you buffer audio, it will 
automatically create latency, and in VOIP latency is very problematic.
maybe convert python in C or C++ would reduce the CPU 
(http://shed-skin.blogspot.com can be an alternative, but lacks of important 
functions), maybe Kundan will find a way to resolve this issue.

Original comment by pratn...@gmail.com on 16 Mar 2011 at 4:32

GoogleCodeExporter commented 9 years ago
also there is maybe something more interesting
http://cython.org/
it helps use C extension in python

Original comment by pratn...@gmail.com on 16 Mar 2011 at 5:03

GoogleCodeExporter commented 9 years ago
yes, buffering is a pain due to latency. In my tests (videoPhone to way speex16 
conversation) switching off rtp reading reduced load 3 times. So, rtp reading 
is probably responsible for about 66% of load. It would be nice If some body 
was able to criticize my hypothesis about the source of the CPU load.

Original comment by Lukasz.P...@gmail.com on 17 Mar 2011 at 12:21

GoogleCodeExporter commented 9 years ago
I read my notes from profiling a moment ago. During the profiling I noted that 
select.select on rtp sockets in multitasking (multitask.py) is responsible for 
CPU load when reading rtp.

Original comment by Lukasz.P...@gmail.com on 17 Mar 2011 at 1:02

GoogleCodeExporter commented 9 years ago
nice, so did you find an alternative of select.select ?
i'm not a python expert sorry...

Original comment by pratn...@gmail.com on 17 Mar 2011 at 6:45

GoogleCodeExporter commented 9 years ago
It seems that to much calls to select.select due to improper multitasking can 
lead to similar CPU overload too. Multitasking could behave differently with 
datagrams or rtp reading trigers bug in multitasking, like poll like 
select.select behavior when timeout is 0.0.

Original comment by Lukasz.P...@gmail.com on 18 Mar 2011 at 9:39

GoogleCodeExporter commented 9 years ago
I am not an python expert, but maybe together we will boil some solution in the 
discussion. All my attempt to reduce load failed. The only thing I can 
recommend for now is to reduce rtp reading as much as possible.

Original comment by Lukasz.P...@gmail.com on 18 Mar 2011 at 9:45

GoogleCodeExporter commented 9 years ago
ok, I think you're right, since you have some proof of what it can causes the 
load.
I will try to google it to find a select.select alternative. Hope Kundan, the 
author of rtmplite will find also something.
I will post here if I found something interesting.
good luck

Original comment by pratn...@gmail.com on 18 Mar 2011 at 5:19

GoogleCodeExporter commented 9 years ago
fond this blog with a python chat application
http://code.activestate.com/recipes/531824-chat-server-client-using-selectselect
/
in comment someone complaints of 60% CPU load for this simple python chat.
I think we have enough proof to say that select.select is the problem

Original comment by pratn...@gmail.com on 18 Mar 2011 at 5:28

GoogleCodeExporter commented 9 years ago
this link is interesting also
http://twistedmatrix.com/documents/current/core/howto/choosing-reactor.html#auto
1

Original comment by pratn...@gmail.com on 18 Mar 2011 at 5:47

GoogleCodeExporter commented 9 years ago
maybe the solution is there
http://www.artima.com/weblogs/viewpost.jsp?thread=230001

Original comment by pratn...@gmail.com on 18 Mar 2011 at 6:11

GoogleCodeExporter commented 9 years ago
seems that Twisted modules are the solution of
I/O asynchronous python task

http://wiki.python.org/moin/Twisted-Examples

only 5 lines of code to create a proxy server, impressive!

Original comment by pratn...@gmail.com on 18 Mar 2011 at 6:34

GoogleCodeExporter commented 9 years ago
I feel one good solution would be to move the media transport processing to 
separate C/C++ module and load it from Python for high performance servers. It 
can then use other techniques (epoll, poll, etc) and keep the media path in 
C/C++. This will also give an opportunity to implement multi-threaded media 
transport processing, which currently is not possible easily with mutitask.py.

If I get support for working on this, I can attempt it. Otherwise, will have to 
wait on this until I find a someone/student to work on it. A friend suggested 
CPython.

Original comment by kundan10 on 20 Mar 2011 at 2:41

GoogleCodeExporter commented 9 years ago
After reading your above post I checked poll/epoll docs. poll/epoll is more 
efficient then select.select due to better scaling with the number of sockets. 
It seems to be possible to implement python implementation of 
select.poll/select.epoll in python multitask.py library. This kind of approach 
is described in python documentation. Is it wise to try to implement it in that 
way to gain some CPU load improvement? Should I try? What is your experience in 
python select.poll/select.epoll use? Please, advise.

Original comment by Lukasz.P...@gmail.com on 20 Mar 2011 at 9:17

GoogleCodeExporter commented 9 years ago
yes poll/eppoll are more efficient, but means that Unix or Windows... ;)
epoll works only on Unix. but for me it's not a problem. I think for now change 
select.select to epoll is the right instant solution... Kundan ?

Original comment by pratn...@gmail.com on 20 Mar 2011 at 9:34

GoogleCodeExporter commented 9 years ago
Multitask registers/unregisters sockets per read/write request, so there would 
be a need to register/unregister socket with poll with every read/write 
request, so it can be slower than expected. Can You estimate this additional 
load?

Original comment by Lukasz.P...@gmail.com on 20 Mar 2011 at 5:51

GoogleCodeExporter commented 9 years ago
if it makes the cpu chargeless, yes,
I never compared.

Original comment by pratn...@gmail.com on 20 Mar 2011 at 6:04

GoogleCodeExporter commented 9 years ago
I finished my second look at CPU load. Tests shows that select.select() 
strategy isn't the source of the problem. I was able to reduce select.select() 
impact to 25% (66% previously) but it reduced the load by 25% only. I did it by 
altering multitask to check for IO not more then 100 times per second. Decrease 
was only 25% because usage of time.delay increased to 39% CPU (many short 0.01 
waits due to multitask.sleep(0.01) in rtmp.py). Moreover I noticed that without 
multitask modification addition of more waits does not change load and quality. 
It seems that whole multitasking is unbalanced in rtmpsip/multitask. I suspect 
that multitask does not work as expected. I suspect that rtmpsip was build on 
assumptions with are not met by multitask.

Original comment by Lukasz.P...@gmail.com on 24 Mar 2011 at 2:20

GoogleCodeExporter commented 9 years ago
interesting, so it means that select.select is one among other problems...
anyway, multitask libraries is not updated since 2007 and the author (I wrote 
to him)
definetly abandonnated the maintain of this project. So I think change 
multitask to (twisted matrix ?) another async multithread solution would be a 
priority if siprtmp wants to evolve.

Original comment by pratn...@gmail.com on 24 Mar 2011 at 2:37

GoogleCodeExporter commented 9 years ago
I remember my PHP experience about multithread sockets and CPU that I used a 
socket function that was working very well with less than 1% CPU load when I 
used NULL as timeout paramater, and if I used any number for it the CPU was at 
99% all the time...
maybe it can help

Original comment by pratn...@gmail.com on 24 Mar 2011 at 2:41

GoogleCodeExporter commented 9 years ago
In my opinion decreasing elementarty select.select execution time (like pool 
implementation) will not help, because gained time will be consumed by other 
components like in my tests.

Original comment by Lukasz.P...@gmail.com on 24 Mar 2011 at 5:03

GoogleCodeExporter commented 9 years ago
ok, how about NULL in execution time ? is python will accept this value for 
exec time ?

Original comment by pratn...@gmail.com on 24 Mar 2011 at 5:08

GoogleCodeExporter commented 9 years ago
I did some fixes in the code in SVN r60 to reduce single call CPU by about 55%. 
The changes are tested only for siprtmp.py call, so may break other stuff.

Following are the changes and corresponding improvements:

0. Single call was taking 10-12% CPU. Even without a call it took 1.7-1.8%.

1. Changed rtmp.py to use multitask.Queue instead of wait for 0.01 in write 
method. This was earlier marked as TODO item. This saves idle mode CPU, so when 
not in a call, it now takes 0.0%. Also in a call, reduces to about 8.5%.

2. Changed multitask.py to handle all the tasks before going back to io-waits 
which does select() call. This reduces the number of select calls. After this 
it takes 4.6-4.8% CPU per call. Now every 20ms, the select gets called about 
twice instead of 10-20 times, which is right.

If I ignore incoming RTP packets after receiving then, call takes 2.5-2.6%, so 
about 2.1-2.3% is taken in processing RTP->RTMP media path. Doesn't look like 
we can do more optimization in Python code using multitask. Some alternatives 
are:
1. Replace the media path to be in C module. Given the way rtmp.py is written, 
this is tricky.
2. I found gevent to be pretty easy to use, so perhaps multitask can be 
replaced by gevent (after lot of modification, of course).

Original comment by kundan10 on 28 Mar 2011 at 9:35

GoogleCodeExporter commented 9 years ago
Great, I reviewed it a moment ago. I will test it and give feedback here.

I suspects 2. leads to freeze bug if you have non IO generators because there 
will be always some new task as a result of task queue processing (I suspect 
that 'for' was to prevent it from happening). So, I propose to fire 
handle_io_waits not less then some threshold (e.g. 0.01 s) when calculated 
timeout is equal 0.0 (i.e. there are some tasks).

I suspects that in general in rtmp.py parseMessages generator yields to 
taskmanager too often. This leads to tasks overload in taskmanager. In general 
deeper profiling is tricky because time.sleep() and select.select() cheats 
cprofile profiler somehow saying that waits consumes CPU. RTP IO generates 0.01 
s waits due to lack of RTP buffering.

Original comment by Lukasz.P...@gmail.com on 28 Mar 2011 at 10:08

GoogleCodeExporter commented 9 years ago
excellent Kundan, I will try your changes today. As Lukasz said do you think 
that create as very small buffer (for ex 20ms) will decrease CPU processing ?

Original comment by pratn...@gmail.com on 28 Mar 2011 at 3:32

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I updated siprtmp.py, rtmp.py and tested siprtmp.py. No errors were found.

Original comment by Lukasz.P...@gmail.com on 28 Mar 2011 at 3:51

GoogleCodeExporter commented 9 years ago
updated and test the CPU, excellent work Kundan ! 0.0% without call,
one call now represents 2% max, which is amazing. BUT, unfortunatly I get
no audio on both side. However I can see the RTP and RTMP flow in debug without 
errors.
and no error also on RTMP side, no exeception in siprtmp. I'm stucked :(

Original comment by pratn...@gmail.com on 29 Mar 2011 at 7:40

GoogleCodeExporter commented 9 years ago
I tested changes in multitask.py too. Cumulative decrease of load in my test is 
only about 25%. Approach proposed by me in Comment 25 gives same results. I 
tested with two bidirectional connections from VideoPhone.swf speex-16kHz on 
one PC to freeswitch conference on remote server (network ping was about 1ms, 
voice was provided and received by head phones with mic). I did cProfile 
profiling, profiling results are attached to this comment. Maybe it can help 
here.

Original comment by Lukasz.P...@gmail.com on 29 Mar 2011 at 10:11

Attachments:

GoogleCodeExporter commented 9 years ago
In the test I used one 20ms frame per packet RTP stream. 

Original comment by Lukasz.P...@gmail.com on 29 Mar 2011 at 10:25

GoogleCodeExporter commented 9 years ago
I analyzed the load from attachment. The another global load enhancement is to 
reduce yielding to minimum. Every 'yield' generates call to multitask causing 
task processing overhead. FYI: cPython measures time spent in a function, and 
it is not equal to CPU load.

Original comment by Lukasz.P...@gmail.com on 29 Mar 2011 at 12:31

GoogleCodeExporter commented 9 years ago
apparently Kundan added new yielding, especially for call() RTMP side functions.
BTW, the result is better so I'm not sure we can here reduce more CPU in python.
on my side I'm yet stucked to make audio work with a server side netConnection 
stream.
no error msg that helps me..

Original comment by pratn...@gmail.com on 29 Mar 2011 at 4:17

GoogleCodeExporter commented 9 years ago
Hi pratn...@gmail.com, are there other changes in your version of 
rtmplite/p2p-sip? Could you send me the zip of your files to kundan10@gmail.com 
so that I can try it out too.

Hi Lukasz.P...@gmail.com, yes I added more yield because after remove sleep of 
0.01 in rtmp.py, and using multitask.Queue, lot of other functions needed to be 
changed to generator.

Also, just to be sure that you are using rtmplite before p2p-sip in PYTHONPATH, 
otherwise it will pick the wrong multitask.py. The rtmplite/multitask.py is 
right, but there is another multitask.py in p2p-sip.

Original comment by kundan10 on 29 Mar 2011 at 5:53

GoogleCodeExporter commented 9 years ago
Yes, 'yieald's introduced to the code with multitask.Queue must stay. As I said 
before, I suspects that parseMessages generator yields to taskmanager too often 
(in rtmp.py). This leads to tasks overload in taskmanager. If there are data in 
the rtmp buffer, then yield should not be invoked. I am testing workaround for 
it. I will send details tomorrow.

Unfortunately logs shows that I am using rtmplite/multitask.py.

Original comment by Lukasz.P...@gmail.com on 29 Mar 2011 at 6:39

GoogleCodeExporter commented 9 years ago
Kundan, 
- does it mean that PYTHONPATH has to have the path of rtmplite AND p2p-sip ?
- ok I'll send to you my hacks, I nedeed in fact to get siprtmp compatible with 
server side NetConnection object which is sligthly different of client one. 
Don't know if it can be useful to implement it yet since I get no audio on last 
revision (r60). need to figure it out. 
Lukasz, it seems that you know deeply the problem and your help in this project 
would be precious...

Original comment by pratn...@gmail.com on 29 Mar 2011 at 11:57

GoogleCodeExporter commented 9 years ago
ok I completly messed up my own update. I used an old revision.
after inserted them manually, finally it works again.
without calls siprtmp is at 0.0%, with one call max cpu is at 5.6% with is like 
Lukasz about 25% better that the previous version.
I noticed a more stable audio quality with less glitches.

Original comment by pratn...@gmail.com on 30 Mar 2011 at 8:43

GoogleCodeExporter commented 9 years ago
Today there was no time for 'yield' issue, I will back to this it soon.

Original comment by Lukasz.P...@gmail.com on 30 Mar 2011 at 6:27

GoogleCodeExporter commented 9 years ago
Ok, I wrote patch for rtmp.py trunk attached to this message, which is reducing 
Reactor usage when parsing RTMP stream. It is only a quick fix. In general rtmp 
parsing approach should be reorganized to reduce reactor usage in more elegant 
way.

Original comment by Lukasz.P...@gmail.com on 3 Apr 2011 at 3:44

Attachments:

GoogleCodeExporter commented 9 years ago
thanks lukasz I'm trying it now.

Original comment by in...@boophone.com on 3 Apr 2011 at 4:20

GoogleCodeExporter commented 9 years ago
I had limited resources for tests so sorry for any typo :)

Original comment by Lukasz.P...@gmail.com on 3 Apr 2011 at 4:26

GoogleCodeExporter commented 9 years ago
My approach to multitask.py is a little different than in trunk and it is 
attached as a patch to this message. 0.01 constant is equal 0.01 and not more 
because reading more then one rtp packet at once can cause rtp parsing to fail 
due to lack of more then one packet per read handler.

Original comment by Lukasz.P...@gmail.com on 3 Apr 2011 at 4:32

Attachments:

GoogleCodeExporter commented 9 years ago
great. did you see performance difference between r68 and your patches ?

Original comment by in...@boophone.com on 3 Apr 2011 at 5:04

GoogleCodeExporter commented 9 years ago
just tried r68 with and without lukasz patch
the cpu for one call is around 10 / 11% which is 2 times more that the previous 
version
(5.9/7%)
thanks

Original comment by in...@boophone.com on 3 Apr 2011 at 7:01

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
sorry I correct : r68 is 2 to 3% greater that r60

Original comment by in...@boophone.com on 3 Apr 2011 at 8:02

GoogleCodeExporter commented 9 years ago
OK. Is there any performance change for you comparing r68 and r68+my_patches? 
My patches for 2 files are independent, you can apply only one at a time and 
test. What are the results?

Original comment by Lukasz.P...@gmail.com on 4 Apr 2011 at 6:30

GoogleCodeExporter commented 9 years ago
it seems that not patched version consume 1/2% less.
I will try one file at time tomorrow
thanks

Original comment by in...@boophone.com on 4 Apr 2011 at 6:37

GoogleCodeExporter commented 9 years ago
Hi Lukasz,

result:
- multitask.py patch increase to 1% and there's some hangup problem (calls 
don't hangup)
- rtmp.py patch gives almost the same CPU result as not patched (maybe 
sometimes 0.3% less with your patch). no problem of hangup at all.

Original comment by in...@boophone.com on 4 Apr 2011 at 6:24