ruipgil / scraperjs

A complete and versatile web scraper.
MIT License
3.71k stars 188 forks source link

Added notes and first note to docs #46

Closed chmac closed 9 years ago

chmac commented 9 years ago

Spent quite a few hours pouring through the code to figure this out. Thought it might be helpful for other folks, so suggesting adding it to the docs.

ruipgil commented 9 years ago

I don't feel that the Readme needs that level of detail.

Besides, if someone wants to do something more advances with requests he/she should dive into the code, where it's pretty clear.

chmac commented 9 years ago

@ruipgil We spent maybe half a day or more trying to figure out why some stuff wasn't working. Eventually today we realised that if we load the URL directly into the phantomjs instance, then the content renders as in the browser, but if we inject the source of the setContent() it does not. One specific example we had issues with was a JSONP callback. The issue today was less clear, it was related to an XHR POST request not being triggered, but no idea why.

@ruipgil Obviously it's up to you if you want to merge the doc change or not. My only suggestion would be, is there any harm in being more explicit? :-)

ruipgil commented 9 years ago

@chmac setContent() should render as in the web browser. XHR POST's should also be triggered, but be careful, since phantomjs makes XHR POST's internally, i.e. without request. So user-agent headers will be provided by phantomjs, and the server may be dropping phantomjs requests.

You're not wrong about making it more explicit. But I feel that the README is already too long, and that is a advanced detail. I'll merge, but I might move the information around.

chmac commented 9 years ago

Great. Reorganising the readme could make sense.

You interested in a reproduction of the issue here seeing with setContent()?

ruipgil commented 9 years ago

Sure, it's always good to know problems and limitations.