laurengarcia / url-metadata

NPM module: Request a url and scrape the metadata from its HTML using Node.js or the browser.
https://www.npmjs.com/package/url-metadata
MIT License
166 stars 43 forks source link

Parsing raw html #71

Closed arosiclair closed 7 months ago

arosiclair commented 7 months ago

It would be great if we could pass a raw HTML string to be parsed rather than relying on the library to make a request. Some websites are blocked on my server, so this library will throw HTTP errors. Using a proxy service works well to get around this, but there's currently no option to pass the HTML to urlMetadata().

laurengarcia commented 7 months ago

That's an interesting use-case, will implement.

MartinMalinda commented 7 months ago

This would be favourable to me too. I'm using this for link expansion and a link can potentially link to other content types, like image/jpeg in which case this throws.

I'd handle other content types myself and only pass response to url-metadata in case of html response. For me it would be more convenient to pass a Response object rather than html string.

laurengarcia commented 7 months ago

Ok, thanks for additional context. Have to think on this some more & how it fits with various use-cases. I like the suggestion of using Response.

laurengarcia commented 7 months ago

Updated roadmap FYI @arosiclair @MartinMalinda

https://github.com/laurengarcia/url-metadata/commit/ae90a39ee5f083db92b5b123356085f9407f53aa

laurengarcia commented 7 months ago

Latest update: version 3.5.0 is in production https://www.npmjs.com/package/url-metadata

Ended up keeping it very simple. Added an option to pass in a Response object. Passing in a raw html string was trickier bc it opens a can of worms around decoding without response headers present. This should fulfill requirements for both use-cases described above.

Implementation details in https://github.com/laurengarcia/url-metadata/pull/74

Updated README and tests, refer to those if you need more details. Sample usage:

// Alternate use-case: parse a Response object instead
try {
  const url = 'https://www.npmjs.com/package/url-metadata';
  // fetch the url in your own code
  const response = await fetch(url);
  // ... do other stuff with it...
  // pass the `response` object to be parsed for its metadata
  const metadata = await urlMetadata(null, { parseResponseObject: response });
  console.log(metadata);
} catch (err) {
  console.log(err);
}