hardkoded / puppeteer-sharp

Headless Chrome .NET API
https://www.puppeteersharp.com
MIT License
3.4k stars 443 forks source link

page.GetContentAsync throwing Cannot read incomplete UTF-16 JSON text as string with missing low surrogate #2775

Closed Tiggerito closed 1 month ago

Tiggerito commented 1 month ago

Description

Navigating to some pages causes the GetContentAsync method to throw an exception.

var options = new LaunchOptions { /*  */ };
var chromiumRevision = BrowserFetcher.DefaultRevision;
var browser = await Puppeteer.LaunchAsync(options, chromiumRevision);
var page = browser.NewPageAsync();
await page.GoToAsync('https://domain.com/');
var content = await page.GetContentAsync(); // exception

Replace domain with getglowingnowskincare;

Expected behavior:

The content is returned.

Actual behavior:

The following exception is thrown:

The JSON value could not be converted to System.String. Path: $ | LineNumber: 0 | BytePositionInLine: 751401. | Cannot read incomplete UTF-16 JSON text as string with missing low surrogate. 
at System.Text.Json.ThrowHelper.ReThrowWithPath(ReadStack& state, Utf8JsonReader& reader, Exception ex) 
at System.Text.Json.Serialization.JsonConverter`1.ReadCore(Utf8JsonReader& reader, JsonSerializerOptions options, ReadStack& state) 
at System.Text.Json.JsonSerializer.ReadFromSpan[TValue](ReadOnlySpan`1 utf8Json, JsonTypeInfo`1 jsonTypeInfo, Nullable`1 actualByteCount) 
at System.Text.Json.JsonSerializer.Deserialize[TValue](JsonElement element, JsonSerializerOptions options) at PuppeteerSharp.Helpers.Json.JsonHelper.ToObject[T](JsonElement element, JsonSerializerOptions options) in /home/runner/work/puppeteer-sharp/puppeteer-sharp/lib/PuppeteerSharp/Helpers/Json/JsonHelper.cs:line 53 
at PuppeteerSharp.Helpers.RemoteObjectHelper.ValueFromType[T](JsonElement value, RemoteObjectType objectType, Boolean stringify) in /home/runner/work/puppeteer-sharp/puppeteer-sharp/lib/PuppeteerSharp/Helpers/RemoteObjectHelper.cs:line 74 
at PuppeteerSharp.Helpers.RemoteObjectHelper.ValueFromRemoteObject[T](RemoteObject remoteObject, Boolean stringify) in /home/runner/work/puppeteer-sharp/puppeteer-sharp/lib/PuppeteerSharp/Helpers/RemoteObjectHelper.cs:line 15 
at PuppeteerSharp.ExecutionContext.RemoteObjectTaskToObject[T](Task`1 remote) 
at PuppeteerSharp.IsolatedWorld.EvaluateFunctionAsync[T](String script, Object[] args) 

Versions

19.0.2 net8.0

Solution

I believe there was a recent change in which JSON parser is used, which may have introduced this issue.

The exception relates to poorly formed characters on the page.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#utf-16_characters_unicode_code_points_and_grapheme_clusters

This can be fixed by converting the returned string with its toWellFormed() function.

I created my version of GetContentAsync with the following line changed, and the content was successfully returned:

content += document.documentElement.outerHTML.toWellFormed();
kblok commented 1 month ago

Do you have some HTML we can use as an example for a test?

Tiggerito commented 1 month ago

getglowingnowskincare(dot)com is an example.

I tried finding a way to make the JSON serializer more forgiving, but I have not found a solid solution yet.

kblok commented 1 month ago

What do you think about this?

Tiggerito commented 1 month ago

I like the idea of making it an option. That way, people can test for the issue.