Closed openminded-oscar closed 9 months ago
Thanks for your note. I actually removed it for version >3.0 just to simplify the api because i thought nobody used it but me; almost nobody ever brought it up here all those years. So, I am happy to add it back into a new release. Give me a few days. Sorry about that.
Thank you for your work! Would be great!
What I was thinking is extracting the charset from header and pass it to the decode function as well as buffer (as it was before):
return fetch(url, requestOpts)
.then((response) => {
if (!response.ok) {
throw new Error(`response code ${response.status}`);
}
// rewrite url if our request had to follow redirects to resolve the
// final link destination (for example: links shortened by bit.ly)
if (response.url) url = response.url;
const contentType = response.headers.get('content-type');
const isText = contentType && contentType.startsWith('text');
const isHTML = contentType && contentType.includes('html');
// extract charset, depends on iconv
const charsetExtracted = extractCharset(contentType);
const charset = (charsetExtracted && iconv.encodingExists(charsetExtracted))? charsetExtracted: 'UTF-8';
if (!isText || !isHTML) {
throw new Error(`unsupported content type: ${contentType}`);
}
return response.arrayBuffer()
.then(buffer => iconv.decode(Buffer.from(buffer), charset, {defaultEncoding: 'UTF-8'}));
})
.then((body) => {
return parse(url, body, opts);
});
}
function extractCharset(contentType) {
const charsetRegex = /charset=([^\s;]+)/;
const match = contentType.match(charsetRegex);
if (match && match[1]) {
return match[1];
} else {
return null;
}
}
Version 3.4.0 is in production supports this now: https://www.npmjs.com/package/url-metadata Implementation details here: https://github.com/laurengarcia/url-metadata/pull/72
Tightened up the lib/extract-charset.js regexes, check the latest on master
.
The previous version 2.5.0 had a decode parameter that should be handling specific encodings. How that should be done in 3.3.0? The problem is reproducing with windows 1251 encoding. The text is unreadable afterwards.
Thank you!