serpapi / serpapi-javascript

Scrape and parse search engine results using SerpApi.
https://serpapi.com
MIT License
45 stars 4 forks source link

Multi-byte character corruption on chunk boundaries #22

Closed martin-serpapi closed 4 months ago

martin-serpapi commented 4 months ago

A customer reported that a question mark symbol is returned in snippet or extracted_snippet.original in the response from our Google Maps Reviews API sometimes. I was able to reproduce it using our serpapi package:

image

serpapi package code

However, the issue is not reproducible when sending requests to our Google Maps Reviews API directly with axios:

image

axios code

Intercom

LrDaniel commented 4 months ago

Also can happen on the name

Freaky commented 4 months ago

I suspect because we're reading data in chunks from the HTTP response and directly appending it to a string, if a chunk happens to end in the middle of a multibyte character JS considers it corrupt and inserts a Unicode Replacement Character - this would explain why they come in pairs.

I'm still waiting for Deno to build, but I suspect something like this is needed:

diff --git src/utils.ts src/utils.ts
index ef49a2a..99aecb6 100644
--- src/utils.ts
+++ src/utils.ts
@@ -1,6 +1,7 @@
 import { version } from "../version.ts";
 import https from "node:https";
 import qs from "node:querystring";
+import { StringDecoder } from "node:string_decoder";
 import { RequestTimeoutError } from "./errors.ts";

 /**
@@ -63,15 +64,17 @@ export function execute(
   return new Promise((resolve, reject) => {
     let timer: number;
     const req = https.get(url, (resp) => {
+      const decoder = new StringDecoder("utf8");
       let data = "";

       // A chunk of data has been recieved.
       resp.on("data", (chunk) => {
-        data += chunk;
+        data += decoder.write(chunk);
       });

       // The whole response has been received. Print out the result.
       resp.on("end", () => {
+        data += decoder.end();
         try {
           if (resp.statusCode == 200) {
             resolve(data);