Library for scraping free proxies lists written in PHP
composer require vantoozz/proxy-scraper:~3 guzzlehttp/guzzle:~7 guzzlehttp/psr7 hanneskod/classtools
<?php declare(strict_types=1);
use function Vantoozz\ProxyScraper\proxyScraper;
require_once __DIR__ . '/vendor/autoload.php';
foreach (proxyScraper()->get() as $proxy) {
echo $proxy . "\n";
}
This is version 3 of the library. For version 2 please check v2 branch; for version 1 please check v1 branch.
The library requires a PSR-18 compatible HTTP client. To use the library you have to install any of them, e.g.:
composer require guzzlehttp/guzzle:~7 guzzlehttp/psr7
All available clients are listed on Packagist: https://packagist.org/providers/psr/http-client-implementation.
Then install proxy-scraper library itself:
composer require vantoozz/proxy-scraper:~3
The simplest way to start using the library is to use proxyScraper()
function which instantiates and configures all
the scrapers.
Please note, auto-configuration function in addition to guzzlehttp/guzzle:~7
and guzzlehttp/psr7
requires hanneskod/classtools
dependency.
composer require guzzlehttp/guzzle:~7 guzzlehttp/psr7 hanneskod/classtools
<?php declare(strict_types=1);
use function Vantoozz\ProxyScraper\proxyScraper;
require_once __DIR__ . '/vendor/autoload.php';
foreach (proxyScraper()->get() as $proxy) {
echo $proxy . "\n";
}
In not using auto-configuration you will need an HTTP client.
The library provides guzzleHttpClient()
function creating and configuring the client.
<?php declare(strict_types=1);
use Vantoozz\ProxyScraper\Exceptions\ScraperException;
use function Vantoozz\ProxyScraper\guzzleHttpClient;
use function Vantoozz\ProxyScraper\proxyScraper;
require_once __DIR__ . '/vendor/autoload.php';
$httpClient = guzzleHttpClient();
$scraper = proxyScraper($httpClient);
try {
echo $scraper->get()->current()->getIpv4(). "\n";
} catch (ScraperException $e) {
echo $e->getMessage() . "\n";
}
You can create own HTTP client by implementing HttpClientInterface
:
<?php declare(strict_types=1);
use Vantoozz\ProxyScraper\Exceptions\ScraperException;
use Vantoozz\ProxyScraper\HttpClient\HttpClientInterface;
use function Vantoozz\ProxyScraper\proxyScraper;
require_once __DIR__ . '/vendor/autoload.php';
$httpClient = new class implements HttpClientInterface {
/**
* @param string $uri
* @return string
*/
public function get(string $uri): string
{
return "some string";
}
};
$scraper = proxyScraper($httpClient);
try {
echo $scraper->get()->current()->getIpv4(). "\n";
} catch (ScraperException $e) {
echo $e->getMessage() . "\n";
}
Of course, you may manually configure the scraper and underlying HTTP client:
<?php declare(strict_types=1);
use Vantoozz\ProxyScraper\Scrapers;
use function Vantoozz\ProxyScraper\guzzleHttpClient;
require_once __DIR__ . '/vendor/autoload.php';
$scraper = new Scrapers\UsProxyScraper(guzzleHttpClient());
foreach ($scraper->get() as $proxy) {
echo $proxy . "\n";
}
You can easily get data from many scrapers at once:
<?php declare(strict_types=1);
use Vantoozz\ProxyScraper\Scrapers;
use function Vantoozz\ProxyScraper\guzzleHttpClient;
require_once __DIR__ . '/vendor/autoload.php';
$httpClient = guzzleHttpClient();
$compositeScraper = new Scrapers\CompositeScraper;
$compositeScraper->addScraper(new Scrapers\FreeProxyListScraper($httpClient));
$compositeScraper->addScraper(new Scrapers\CoolProxyScraper($httpClient));
$compositeScraper->addScraper(new Scrapers\SocksProxyScraper($httpClient));
foreach ($compositeScraper->get() as $proxy) {
echo $proxy . "\n";
}
Sometimes things go wrong. This example shows how to handle errors while getting data from many scrapers:
<?php declare(strict_types=1);
use Vantoozz\ProxyScraper\Exceptions\ScraperException;
use Vantoozz\ProxyScraper\Ipv4;
use Vantoozz\ProxyScraper\Port;
use Vantoozz\ProxyScraper\Proxy;
use Vantoozz\ProxyScraper\Scrapers;
require_once __DIR__ . '/vendor/autoload.php';
$compositeScraper = new Scrapers\CompositeScraper;
// Set exception handler
$compositeScraper->handleScraperExceptionWith(function (ScraperException $e) {
echo 'An error occurred: ' . $e->getMessage() . "\n";
});
// Fake scraper throwing an exception
$compositeScraper->addScraper(new class implements Scrapers\ScraperInterface {
public function get(): Generator
{
throw new ScraperException('some error');
}
});
// Fake scraper with no exceptions
$compositeScraper->addScraper(new class implements Scrapers\ScraperInterface {
public function get(): Generator
{
yield new Proxy(new Ipv4('192.168.0.1'), new Port(8888));
}
});
//Run composite scraper
foreach ($compositeScraper->get() as $proxy) {
echo $proxy . "\n";
}
Will output
An error occurred: some error
192.168.0.1:8888
In the same manner you may configure exceptions handling for the scraper created with proxyScraper()
function as it
returns an instance of CompositeScraper
:
<?php declare(strict_types=1);
use Vantoozz\ProxyScraper\Exceptions\ScraperException;
use function Vantoozz\ProxyScraper\proxyScraper;
require_once __DIR__ . '/vendor/autoload.php';
$scraper = proxyScraper();
$scraper->handleScraperExceptionWith(function (ScraperException $e) {
echo 'An error occurs: ' . $e->getMessage() . "\n";
});
Validation steps may be added:
<?php declare(strict_types = 1);
use Vantoozz\ProxyScraper\Exceptions\ValidationException;
use Vantoozz\ProxyScraper\Ipv4;
use Vantoozz\ProxyScraper\Port;
use Vantoozz\ProxyScraper\Proxy;
use Vantoozz\ProxyScraper\Scrapers;
use Vantoozz\ProxyScraper\Validators;
require_once __DIR__ . '/vendor/autoload.php';
$scraper = new class implements Scrapers\ScraperInterface
{
public function get(): \Generator
{
yield new Proxy(new Ipv4('104.202.117.106'), new Port(1234));
yield new Proxy(new Ipv4('192.168.0.1'), new Port(8888));
}
};
$validator = new Validators\ValidatorPipeline;
$validator->addStep(new Validators\Ipv4RangeValidator);
foreach ($scraper->get() as $proxy) {
try {
$validator->validate($proxy);
echo '[OK] ' . $proxy . "\n";
} catch (ValidationException $e) {
echo '[Error] ' . $e->getMessage() . ': ' . $proxy . "\n";
}
}
Will output
[OK] 104.202.117.106:1234
[Error] IPv4 is in private range: 192.168.0.1:8888
A Proxy object may have metrics (metadata) associated with.
By default, Proxy object has source metric:
<?php declare(strict_types=1);
use Vantoozz\ProxyScraper\Proxy;
use Vantoozz\ProxyScraper\Scrapers;
use function Vantoozz\ProxyScraper\guzzleHttpClient;
require_once __DIR__ . '/vendor/autoload.php';
$scraper = new Scrapers\UsProxyScraper(guzzleHttpClient());
/** @var Proxy $proxy */
$proxy = $scraper->get()->current();
foreach ($proxy->getMetrics() as $metric) {
echo $metric->getName() . ': ' . $metric->getValue() . "\n";
}
Will output
source: Vantoozz\ProxyScraper\Scrapers\UsProxyScraper
Note. Examples use Guzzle as HTTP client.
./vendor/bin/phpunit --testsuite=unit
./vendor/bin/phpunit --testsuite=integration
php ./tests/systemTests.php
The biggest difference from version 2 is the HTTP client configuration.
Instead of
$httpClient = new \Vantoozz\ProxyScraper\HttpClient\Psr18HttpClient(
new \Http\Adapter\Guzzle6\Client(new \GuzzleHttp\Client([
'connect_timeout' => 2,
'timeout' => 3,
])),
new \Http\Message\MessageFactory\GuzzleMessageFactory
);
the client should be instantiated like
$httpClient = \Vantoozz\ProxyScraper\guzzleHttpClient();