peerigon / phridge

A bridge between node and PhantomJS
The Unlicense
519 stars 50 forks source link

Accumulating slow down on sequential Page.run() #12

Closed frank-weindel closed 10 years ago

frank-weindel commented 10 years ago

I'm trying to create an application that renders PNG files of many webpages one at a time. I'm doing it with one instance of phantom using phridge. For each webpage I want to render I create a new page via phridge's phantom.createPage() and then call run() on it. The run() callback (which runs in phantomjs land) simply does a page.open() and renders the PNG before returning.

I'm noticing that with every subsequent run() call I do, it takes more and more time for the phantomjs function to start executing.

The severity of the accumulating delay is related to how much data we're pushing through via the run() call, which in my case is about 500kb. This causes about an additional 2 seconds for each call (i.e. the first one takes 0 seconds, then 2 seconds, then 4 seconds, and so on) However, even without all of that data there is still an accumulation albeit a lot smaller (i.e. 0 seconds, 0.1 seconds, 0.2 seconds, 0.3 seconds, etc)

I'm thinking it has to do with the plumbing in writing to stdin. As in its getting clogged up and hence slowed down. When I add about a 4 second delay between subsequent run calls, the accumulation goes away. So my thinking is that some sort of garbage collection is allowed to run in this time.

Any ideas?

jhnns commented 10 years ago

Mhmm interesting. There are two things I would like to know:

frank-weindel commented 10 years ago

Thanks for the quick reply. Commenting out the line made no difference, however, downgrading to 0.1.5 did fix it. It seems to be working well. Are there any major reasons to not run with 0.1.5? I'd also be curious about what is causing the slow down if you know and feel like explaining it. :)

jhnns commented 10 years ago

I switched from HTTP to stdin/stdout with 1.x which is more secure and - at least in my unit tests - way faster.

I think you're right by blaming the GC for this delay. Unfortunately PhantomJS only allows synchronous reading from stdin. I'm using setTimeout, 0 to read the next line of stdin to allow asynchronous callbacks to be executed. If I'd just use a while() loop, no callback would ever be executed.

Can you upgrade back to 1.x and gradually increase that timeout? Maybe 0 is to low and PhantomJS' GC can't catch up.

aju commented 10 years ago

Hi I added pull request that should fix this issues.

aju commented 10 years ago

I still got some issues with heartbeat. For example when I run async code in phantom but some operation on phantomjs is taking long time to run(blocking phantomjs event loop) - heartbeat messages are cluttering stdin.

Maybe it would be better to make it in form of ping-pong instead of heartbeat?

Other solution would be using something like this to communicate: https://github.com/ariya/phantomjs/issues/10980#issuecomment-21855869

What do you think?

jhnns commented 10 years ago

What exactly do you mean by "cluttering stdin"? Does it block phantomjs in any way? Or do you read something else from stdin?

aju commented 10 years ago

I mean that while phantom is busy with sync opperation node process is sending to it heartbeat messages. After finishing this sync operation phanton needs to read all of the sent heartbeat messages (it takes quite a lot of time depending on how long phantom was bussy and unable to read stdin messages)

frank-weindel commented 10 years ago

Just an FYI. The latest version w/ @aju's PR seems to eliminate the accumulation issue. Thanks a lot.

jhnns commented 10 years ago

@aju ping/pong would probably be better... but I'm surprised that it is a problem at all. That's just a matter of reading a few bytes...

aju commented 10 years ago

Its not about a size :). When phantomjs get busy node is still sending heartbeat messages. Those messages are accumulating (because phantom is busy with some operation and can't handle them). All of those messages needs to be read from stdin one by one in the loop. I checked and "setTimeout(lopp, 0);" is triggered every ~8-10ms. So when phantom was busy for lets say 100ms we got 10 messages to read - this will take around 80ms-100ms if I'm correct.

I hope you now get my point.

That's why I think we should replace it with ping-pong version. I'm already experimenting with this.

jhnns commented 10 years ago

Yep, you're right. Thanks for the explanation!