studioespresso / craft-scout

Craft Scout provides a simple solution for adding full-text search to your entries. Scout will automatically keep your search indexes in sync with your entries.
MIT License
81 stars 54 forks source link

Error: json_encode error: Malformed UTF-8 characters, possibly incorrectly encoded #326

Closed lukew-cogapp closed 2 months ago

lukew-cogapp commented 2 months ago

Hello,

We're using Craft CMS 4.8.9 and Scout 3.3.3.

Some of our entries are not indexing to Algolia and we get: Error: json_encode error: Malformed UTF-8 characters, possibly incorrectly encoded

Craft is fine saving these entires to the DB, here's one of the entries in question (frontend): https://www.nts.org.uk/stories/vital-support-for-our-national-nature-reserves

Our assumption is that there's something in the content that isn't being sanitised, our initial thoughts were perhaps the use of the different single quote style but we have an earlier version in the index that was fine:

{
  "id": 717598,
  "title": "Vital support for our National Nature Reserves",
  "uri": "stories/vital-support-for-our-national-nature-reserves",
  "url": "https://www.nts.org.uk/stories/vital-support-for-our-national-nature-reserves",
  "absoluteUri": "/stories/vital-support-for-our-national-nature-reserves",
  "slug": "vital-support-for-our-national-nature-reserves",
  "enabled": true,
  "archived": false,
  "dateCreated": 1662652319,
  "dateUpdated": 1663156682,
  "postDate": 1663664400,
  "authorId": 1,
  "authorName": "x",
  "sectionId": 7,
  "typeId": 12,
  "summaryText": "Support from players of People’s Postcode Lottery has enabled us to carry out important conservation work at our National Nature Reserves.",
  "introText": "Support from players of People’s Postcode Lottery has enabled us to carry out important conservation work at our National Nature Reserves.",
  "image": [
    {
      "image": "https://ntswebstorage01.blob.core.windows.net/nts-web-assets-production/general/Ellie_Owen_Senior_Seabird_Officer_0922.jpg",
      "altText": "A woman wearing a red helmet and red floatation jacket stands on a rocky ledge beside a sea inlet. Large rock stacks stand just off shore behind her.",
      "accreditation": null
    }
  ],
  "entryAuthor": null,
  "showPublicationDate": true,
  "mainContent": " We look after eight National Nature Reserves (NNRs) – St Abb’s Head, St Kilda, Ben Lawers, Glencoe, Staffa, Corrieshalloch Gorge, Mar Lodge Estate and Beinn Eighe (Torridon) – and we are excited to be able to carry out more conservation work at these places, thanks to support from players of People’s Postcode Lottery. This work will include more species monitoring, habitat restoration and regeneration efforts, all part of our charity’s conservation and sustainability measures set out in our new ten-year strategy, Nature, Beauty & Heritage for Everyone.\nSince 2012, players of People’s Postcode Lottery have raised over £1.7 million (awarded by the Postcode Earth Trust) to help us carry out vital work to protect Scotland’s heritage. This year’s generous support has enabled the delivery of a new biosecurity project. Heading up the project is Dr Ellie Owen, who joins the Trust in the newly created role of Senior Seabird Officer. The role has also been supported by Tim and Kim Allan, members of the Trust’s Patrons’ Club.Ellie is a top seabird scientist who specialises in puffins, seabird tracking, citizen science and offshore windfarm impacts on seabirds. The biosecurity project is part of our ‘Love for Nature’ project that aims to safeguard Scotland’s natural heritage by preventing plants and animals that are not usually part of sensitive island ecosystems from reaching them. Ellie’s work will coordinate both existing and new efforts to protect our islands. She will also set up a rapid response team in case any issues arise. Ellie will play a vital role across our three coastal and island NNRs – Staffa, St Kilda and St Abb’s Head – which are home to hundreds of thousands of seabirds each summer. However, a host of other important sites will also benefit, including Fair Isle, Canna, Mingulay, Berneray & Pabbay, parts of Unst & Yell, Burg on Mull, and the Murray Isles in the Solway Firth.\nEllie said: ‘I am delighted to be taking on the role of Senior Seabird Officer for a charity that values conservation and nature. Across the country, there are people who love to visit our sites and want to see seabirds thrive in their natural environment. It’s our job to help monitor and conserve these seabirds, and the islands and coastlines they inhabit.’ Philip Long OBE, Chief Executive of the National Trust for Scotland, said: \n‘In our ten-year strategy, we set out bold ambitions in caring for and preserving not only Scotland’s built heritage, but also its vast natural landscapes. There is so much more to the Trust than many people may be aware of, with hundreds of thousands of seabird habitats in our care, almost every type of flora and fauna, and abundant and varied sea life; so much of this can be found within our National Nature Reserves. By focusing more of our conservation efforts in these special locations, we’re both improving habitats and biodiversity and taking further steps in our charity’s e",
  "storiesHeader": [
    {
      "file": "https://ntswebstorage01.blob.core.windows.net/nts-web-assets-production/general/Ellie_Owen_Senior_Seabird_Officer_0922.jpg",
      "altText": "A woman wearing a red helmet and red floatation jacket stands on a rocky ledge beside a sea inlet. Large rock stacks stand just off shore behind her.",
      "accreditation": "Dr Ellie Owen, our new Senior Seabird Officer, will lead a new biosecurity project.",
      "fullWidth": true
    }
  ],
  "excludeFromSearch": null,
  "type": "article",
  "section": "stories",
  "objectID": "717598"
}
janhenckens commented 2 months ago

Could you share the full stack trace of the error @lukew-cogapp ?

lukew-cogapp commented 2 months ago

Hi @janhenckens I'm afraid not at the moment, once we have this I'll reply here with it.

lukew-cogapp commented 2 months ago

Hi @janhenckens

Here's the full trace:

Stack trace:
#0 /var/www/html/vendor/algolia/algoliasearch-client-php/src/RetryStrategy/ApiWrapper.php(152): Algolia\AlgoliaSearch\RetryStrategy\ApiWrapper->createRequest('POST', Object(Algolia\AlgoliaSearch\Http\Psr7\Uri), Array, false)
#1 /var/www/html/vendor/algolia/algoliasearch-client-php/src/RetryStrategy/ApiWrapper.php(100): Algolia\AlgoliaSearch\RetryStrategy\ApiWrapper->request('POST', '/1/indexes/site...', Object(Algolia\AlgoliaSearch\RequestOptions\RequestOptions), Array, 30, Array)
#2 /var/www/html/vendor/algolia/algoliasearch-client-php/src/SearchIndex.php(298): Algolia\AlgoliaSearch\RetryStrategy\ApiWrapper->write('POST', '/1/indexes/site...', Array, Object(Algolia\AlgoliaSearch\RequestOptions\RequestOptions))
#3 /var/www/html/vendor/algolia/algoliasearch-client-php/src/SearchIndex.php(338): Algolia\AlgoliaSearch\SearchIndex->rawBatch(Array, Array)
#4 /var/www/html/vendor/algolia/algoliasearch-client-php/src/SearchIndex.php(183): Algolia\AlgoliaSearch\SearchIndex->splitIntoBatches('updateObject', Array, Array)
#5 /var/www/html/vendor/studioespresso/craft-scout/src/engines/AlgoliaEngine.php(52): Algolia\AlgoliaSearch\SearchIndex->saveObjects(Array)
#6 /var/www/html/vendor/studioespresso/craft-scout/src/jobs/MakeSearchable.php(30): rias\scout\engines\AlgoliaEngine->update(Object(Illuminate\Support\Collection))
#7 /var/www/html/vendor/yiisoft/yii2-queue/src/Queue.php(243): rias\scout\jobs\MakeSearchable->execute(Object(craft\queue\Queue))
#8 /var/www/html/vendor/yiisoft/yii2-queue/src/cli/Queue.php(147): yii\queue\Queue->handleMessage(4625957, 'O:30:"rias\\scou...', 300, 1)
#9 /var/www/html/vendor/craftcms/cms/src/queue/Queue.php(191): yii\queue\cli\Queue->handleMessage(4625957, 'O:30:"rias\\scou...', 300, 1)
#10 /var/www/html/vendor/craftcms/cms/src/queue/Queue.php(166): craft\queue\Queue->executeJob()
#11 [internal function]: craft\queue\Queue->craft\queue\{closure}(Object(Closure))
#12 /var/www/html/vendor/yiisoft/yii2-queue/src/cli/Queue.php(114): call_user_func(Object(Closure), Object(Closure))
#13 /var/www/html/vendor/craftcms/cms/src/queue/Queue.php(164): yii\queue\cli\Queue->runWorker(Object(Closure))
#14 /var/www/html/vendor/craftcms/cms/src/controllers/QueueController.php(82): craft\queue\Queue->run()
#15 /var/www/html/vendor/craftcms/cms/src/controllers/QueueController.php(103): craft\controllers\QueueController->actionRun()
#16 [internal function]: craft\controllers\QueueController->actionRetry()
#17 /var/www/html/vendor/yiisoft/yii2/base/InlineAction.php(57): call_user_func_array(Array, Array)
#18 /var/www/html/vendor/yiisoft/yii2/base/Controller.php(178): yii\base\InlineAction->runWithParams(Array)
#19 /var/www/html/vendor/yiisoft/yii2/base/Module.php(552): yii\base\Controller->runAction('retry', Array)
#20 /var/www/html/vendor/craftcms/cms/src/web/Application.php(341): yii\base\Module->runAction('queue/retry', Array)
#21 /var/www/html/vendor/craftcms/cms/src/web/Application.php(642): craft\web\Application->runAction('queue/retry', Array)
#22 /var/www/html/vendor/craftcms/cms/src/web/Application.php(303): craft\web\Application->_processActionRequest(Object(craft\web\Request))
#23 /var/www/html/vendor/yiisoft/yii2/base/Application.php(384): craft\web\Application->handleRequest(Object(craft\web\Request))
#24 /var/www/html/web/index.php(26): yii\base\Application->run()
#25 {main} {"memory":14680520,"exception":"[object] (InvalidArgumentException(code: 0): json_encode error: Malformed UTF-8 characters, possibly incorrectly encoded at /var/www/html/vendor/algolia/algoliasearch-client-php/src/RetryStrategy/ApiWrapper.php:253)"} 

And we have also updated to 4.1.1 and the same issue persists

lukew-cogapp commented 2 months ago

After some dirty debugging of the Algolia library, it looks like it's struggling on unmatched smart quotes i.e

This text fails by: it’s also a good choice:

Ilex crenata ‘Convexa’ looks nothing like a traditional holly, but instead looks like a box (Buxus) plant. It can be used as an alternative to box, where box blight and box tree moth have become a problem, as the foliage is tightly arranged and it tolerates regular pruning. It’s slow-growing nature makes it an excellent candidate for anyone looking for structure without needing to carry out too much maintenance. If you have a balcony or patio, it’s also a good choice for containers.

But note that if I change to it's (normal single quote), everything works as expected. The odd thing is that ‘Convexa’ (which includes matched smart quotes) works just fine and causes no issues.

I appreciate this is happening in the Algolia library and not yours, but if there's a workaround you could add that means the client can still use smart quotes, that would be great.

lukew-cogapp commented 2 months ago

nevermind, we've found the real issue, there was code elsewhere that was splitting strings after 3000 characters, if the split was on a special character (like a smart/curly quote) it would create a broken character. I'll close this off.

janhenckens commented 2 months ago

Good you hear you found the issue @lukew-cogapp !