ulixee / hero

The web browser built for scraping
MIT License
647 stars 32 forks source link

Unable to convert PDF files to arrayBuffer using Response.arrayBuffer(). #275

Open Cmeesh11 opened 2 weeks ago

Cmeesh11 commented 2 weeks ago

Using hero.fetch, I'm able to properly retrieve a pdf file from a site. I want to convert this to a buffer so I can save it, but I get this error every time:

2024-06-14T17:21:25.757Z ERROR [hero-core/connections/ConnectionToHeroClient] ConnectionToClient.HandleRequestError {
  context: {},
  sessionId: 'ErM0QZuIvKRJrfdV2GwaK',
  sessionName: undefined
} InjectedScriptError: InvalidCharacterError: Failed to execute 'btoa' on 'Window': The string to be encoded contains characters outside of the Latin1 range.
    at JsPath.runJsPath (/Users/cartermichaud/Development/PortalIntegrations/talon-portal-integrations-hero/agent/main/lib/JsPath.ts:165:13)
    at async FrameEnvironment.execJsPath (/Users/cartermichaud/Development/PortalIntegrations/talon-portal-integrations-hero/node_modules/core/lib/FrameEnvironment.ts:246:12)
    at async CommandRecorder.runCommandFn (/Users/cartermichaud/Development/PortalIntegrations/talon-portal-integrations-hero/node_modules/core/lib/CommandRecorder.ts:90:16)
    at async CommandRunner.runFn (/Users/cartermichaud/Development/PortalIntegrations/talon-portal-integrations-hero/node_modules/core/lib/CommandRunner.ts:36:14)
    at async ConnectionToHeroClient.executeCommand (/Users/cartermichaud/Development/node_modules/core/connections/ConnectionToHeroClient.ts:258:12)
    at async ConnectionToHeroClient.handleRequest (/Users/cartermichaud/Development/node_modules/core/connections/ConnectionToHeroClient.ts:66:14) {
  pathState: { step: [ 'arrayBuffer' ], index: 2 }
}
CLAIMS ERROR: InjectedScriptError: InvalidCharacterError: Failed to execute 'btoa' on 'Window': The string to be encoded contains characters outside of the Latin1 range.

Here is the request I'm making:

const res = await hero.fetch( url, {
  "method" : "get",
  "headers" : {
    "Accept" : "application/pdf",
    "Authorization" : `Bearer ${this.portal.authToken}`
  },
  "credentials" : "include"
} );

const buffer = await res.arrayBuffer(); // Throws error here
blakebyrnes commented 2 weeks ago

Thanks! I think I see issue in the dom type serializer layer. Btoa apparently only works with ascii characters.

Cmeesh11 commented 2 weeks ago

Exactly, and I believe pdf files are all going to have non ascii characters, so makes sense!

blakebyrnes commented 2 weeks ago

Well, then I went into code and we're doing this already :

const binary = Array.from(new Uint8Array(value.buffer, value.byteOffset, value.byteLength)) .map(byte => String.fromCharCode(byte)) .join('');

I think we had to move away from the String.fromCharCode(..byte) because the varargs breaks at some size of binary. Need to figure out what to replace this with.

Cmeesh11 commented 2 weeks ago

If the varargs limit is an issue, I believe TextDecoder works pretty well and is designed to handle larger arrays. You could do something like:

const binary = new Uint8Array(value.buffer, value.byteOffset, value.byteLength);
const decodedString = new TextDecoder('utf-8').decode(binary);

But just a suggestion.

blakebyrnes commented 2 weeks ago

Good suggestion. I was exploring that as well. I think (you might try to modify your node_modules to check) that your binary is encoded in latin1. If that's the case, this will probably break your encoding, so you'd end up with something weird like:

 let decodedString;
  try {
    // Attempt to decode using UTF-8
    decodedString = new TextDecoder('utf-8').decode(binary);
  } catch (e) {
    // Fallback to Latin-1 if UTF-8 decoding fails
    decodedString = new TextDecoder('latin1').decode(binary);
  }

I think we could probably also avoid variadic by doing

const dataArray = Array.from(new Uint8Array(value.buffer, value.byteOffset, value.byteLength));
const binary = String.fromCharCode.apply(null, dataArray);

This would be in TypeSerializer in @ulixee/commons in node modules

blakebyrnes commented 1 week ago

Realized I hadn't checked in a fix for this to the commons project. It's in there now if you want to try it out. Or you can wait for next release