napoler / ganon

Automatically exported from code.google.com/p/ganon
0 stars 0 forks source link

Some sites not being loaded by file_get_dom #48

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What will reproduce the problem?
Grabbing the DOM of some sites just doesn't seem to work.  Here's one that 
fails for me: http://www.hisradio.com

What is the expected output? What do you see instead?
I expect it to grab the DOM, like when I use http://www.google.com

Which version are you using?
Latest, using php 5.3

Please provide any additional information below.
I'm thinking there are server settings that disallow php access, possibly in a 
robots.txt file or something along those lines.  Am I missing something?

Original issue reported on code.google.com by apt94je...@gmail.com on 31 Aug 2013 at 3:22

GoogleCodeExporter commented 9 years ago
You must use custom_context or get source via CURL. Server reject connection if 
user agent is not specified or is incorrect (some parser?) for simple 
preventing to get site souce by basic functions or flood/ddos maybe.

Try this:

$opts = array(
  'http'=> array(
    //'method'=>   "GET",
    'user_agent'=>    $_SERVER['HTTP_USER_AGENT']
  )
);

 // Parse the google code website into a DOM
$html = file_get_dom('http://www.hisradio.com/', true, false, 
stream_context_create($opts))
echo $html;

This use your user-agent in request to hisradio.com. Of course if your PHP 
version is >=5.0.0

Original comment by bartek12...@gmail.com on 3 Feb 2014 at 11:42