spekulatius / PHPScraper

A universal web-util for PHP.
https://phpscraper.de
GNU General Public License v3.0
527 stars 74 forks source link

Handle edge case: "invalid url in base tag" #179

Open delamotte-pierrick opened 1 year ago

delamotte-pierrick commented 1 year ago

Add test and bugfix in case the href content of 'base' tag is a relative url.

The bug have been encounter on the url : https://www.gla.ac.uk/myglasgow/digitalaccessibility/

the current base tag contain : <base href="/myglasgow/digitalaccessibility/" />

image

Stack Trace and error message:

string(193) "The URL of the element is relative, so you must define its base URI passing an absolute URL to the constructor of the "Symfony\Component\DomCrawler\AbstractUriElement" class ("://" was passed)."

#0 /srv/app/vendor/spekulatius/phpscraper/src/UsesContent.php(492): Symfony\Component\DomCrawler\AbstractUriElement->__construct(Object(DOMElement), '://')
#1 /srv/app/vendor/spekulatius/phpscraper/src/PHPScraper.php(138): Spekulatius\PHPScraper\Core->linksWithDetails()
#2 /srv/app/src/Services/ScraperService.php(107): Spekulatius\PHPScraper\PHPScraper->__call('linksWithDetail...', Array)
#3 /srv/app/src/Services/AnalyzerService.php(56): App\Services\ScraperService->extractData(Object(Spekulatius\PHPScraper\PHPScraper))
#4 /srv/app/src/Services/AnalyzerService.php(28): App\Services\AnalyzerService->extractDataFromUrl('https://www.gla...', 'en', 'uk')
#5 /srv/app/src/Controller/HomeController.php(80): App\Services\AnalyzerService->analyzeUrl('https://www.gla...', Object(App\Entity\Language), 'uk')
#6 /srv/app/vendor/symfony/http-kernel/HttpKernel.php(163): App\Controller\HomeController->index(Object(Symfony\Component\HttpFoundation\Request))
#7 /srv/app/vendor/symfony/http-kernel/HttpKernel.php(74): Symfony\Component\HttpKernel\HttpKernel->handleRaw(Object(Symfony\Component\HttpFoundation\Request), 1)
#8 /srv/app/vendor/symfony/http-kernel/Kernel.php(184): Symfony\Component\HttpKernel\HttpKernel->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true)
#9 /srv/app/vendor/symfony/runtime/Runner/Symfony/HttpKernelRunner.php(35): Symfony\Component\HttpKernel\Kernel->handle(Object(Symfony\Component\HttpFoundation\Request))
#10 /srv/app/vendor/autoload_runtime.php(29): Symfony\Component\Runtime\Runner\Symfony\HttpKernelRunner->run()
#11 /srv/app/public/index.php(5): require_once('/srv/app/vendor...')
#12 {main}"

Is that possible for you to generate an url with an invalid base tag, that i can use instead of those of Glasgow University, please ?

Regards, Pierrick

spekulatius commented 1 year ago

Hello Pierrick,

Thank you for your PR and an interesting discovery!

For tests, I usually 'mock up' a small HTML page and add it to this test pages repo. This way the hosted examples remain in control and won't change unexpectedly. Could you prepare such a test page and tweak the test?

Let me know if any questions come up :)

Peter

delamotte-pierrick commented 1 year ago

Hey Peter,

I have added the file and maded the pull request here.

I'm not sure if the domain https://test-pages.phpscraper.de is the good one for the test, so just tell me if I need to change them

Have a nice day ! Pierrick

spekulatius commented 1 year ago

Hello Pierrick,

Thank you for the PR!

The host should be okay, no?

According to Mozilla relative paths are valid. I was wondering if we should return an absolute URL. Should we rename the tests accordingly and make sure it returns the absolute PATH (even if relative is provided)? PHPScraper returns absolute URLs for anything, even if relative paths are provided.

Cheers, Peter