laurengarcia / url-metadata

NPM module: Request a url and scrape the metadata from its HTML using Node.js or the browser.
https://www.npmjs.com/package/url-metadata
MIT License
166 stars 44 forks source link

Handling specific encodings dropped? #70

Closed openminded-oscar closed 9 months ago

openminded-oscar commented 9 months ago

The previous version 2.5.0 had a decode parameter that should be handling specific encodings. How that should be done in 3.3.0? The problem is reproducing with windows 1251 encoding. The text is unreadable afterwards.

Thank you!

openminded-oscar commented 9 months ago

https://developer.mozilla.org/en-US/docs/Web/API/Response/text

laurengarcia commented 9 months ago

Thanks for your note. I actually removed it for version >3.0 just to simplify the api because i thought nobody used it but me; almost nobody ever brought it up here all those years. So, I am happy to add it back into a new release. Give me a few days. Sorry about that.

openminded-oscar commented 9 months ago

Thank you for your work! Would be great!

openminded-oscar commented 9 months ago

What I was thinking is extracting the charset from header and pass it to the decode function as well as buffer (as it was before):

return fetch(url, requestOpts)
    .then((response) => {
      if (!response.ok) {
        throw new Error(`response code ${response.status}`);
      }

      // rewrite url if our request had to follow redirects to resolve the
      // final link destination (for example: links shortened by bit.ly)
      if (response.url) url = response.url;

      const contentType = response.headers.get('content-type');
      const isText = contentType && contentType.startsWith('text');
      const isHTML = contentType && contentType.includes('html');
      // extract charset, depends on iconv
      const charsetExtracted = extractCharset(contentType);
      const charset = (charsetExtracted && iconv.encodingExists(charsetExtracted))? charsetExtracted: 'UTF-8';

      if (!isText || !isHTML) {
        throw new Error(`unsupported content type: ${contentType}`);
      }

      return response.arrayBuffer()
        .then(buffer => iconv.decode(Buffer.from(buffer), charset, {defaultEncoding: 'UTF-8'}));
    })
    .then((body) => {
      return parse(url, body, opts);
    });
}

function extractCharset(contentType) {
  const charsetRegex = /charset=([^\s;]+)/;
  const match = contentType.match(charsetRegex);

  if (match && match[1]) {
    return match[1];
  } else {
    return null;
  }
}
laurengarcia commented 9 months ago

Version 3.4.0 is in production supports this now: https://www.npmjs.com/package/url-metadata Implementation details here: https://github.com/laurengarcia/url-metadata/pull/72

laurengarcia commented 9 months ago

Tightened up the lib/extract-charset.js regexes, check the latest on master.