spekulatius / PHPScraper

A universal web-util for PHP.
https://phpscraper.de
GNU General Public License v3.0
509 stars 73 forks source link

Spanish web content not displayed correctly '?' is putted instead of the correct character #189

Open ElliotFer2000 opened 12 months ago

ElliotFer2000 commented 12 months ago

Spanish words with accents are not properly displayed, char with accents are being replaced with a "?" character

why is this happening? How can I tell the scrapper I'm dealing with the spanish language?

code:

$web = new \Spekulatius\PHPScraper\PHPScraper;

$web->go("https://www.marca.com");

return $web->outlineWithParagraphs;

I return the outline back to the client in json format, the result I'm getting is something like this:

[
    {
        "tag": "h2",
        "content": "Joao F?lix: \"El Bar?a siempre ha sido mi primera opci?n\""
    }
]

I have already tried to solve the problem by putting this at the beggining of the script: setlocale(LC_ALL, 'es_AR')

F?lix and opci?n are not properly displayed in the response, it should be Félixand Opción , ? is being showed instead of é and ó

When I return the result of this function the characters display correctly

utf8_encode(file_get_contents("https://www.marca.com"))

I have tried to request the document with file_get_contents, encode the result and then pass the result to $web->setContent function, I get the expected output working in this way.

            $web = new PHPScraper;
            $rawPageContent = utf8_encode(file_get_contents("https://www.marca.com"));
            $web->setContent("https://www.marca.com",$rawPageContent);
spekulatius commented 11 months ago

Hello @ElliotFer2000

it looks like the fetching isn't using the correct encoding. I managed to confirm the issue. Have you checked how this could be resolved?

Peter