laurengarcia / url-metadata

NPM module: Request a url and scrape the metadata from its HTML using Node.js or the browser.
https://www.npmjs.com/package/url-metadata
MIT License
166 stars 43 forks source link

Passing in HTML? #75

Closed ajmas closed 7 months ago

ajmas commented 7 months ago

While playing with this I've run into problem sites like www.crunchyroll.com, whereby the page metadata is not available until the page is rendered. For this reason I am looking to use puppeteer in certain scenarios, to render the page and then get the HTML, though from what I can see I can't pass this HTML to url-metadata.

  async function getRenderedHtml (pageUrl: string): Promise<string> {
    const browser = await puppeteer.launch();
    try {
      const page = await browser.newPage();
      await page.goto(pageUrl);
      await page.waitForSelector('meta[name=description]', { timeout: 5000 });
      return await page.content();
    } finally {
      await browser.close();
    }
  }

Is there any way I could pass the HTML to url-metadata, so that it can process the content and provide the parsed metadata?

BTW I did see the 'alternate' use-case, with parseResponseObject, so will see if there is a way I could create a compatible response object, but just using the HTML I already have:

// Alternate use-case: parse a Response object instead
try {
  // fetch the url in your own code
  const response = await fetch('https://www.npmjs.com/package/url-metadata');
  // ... do other stuff with it...
  // pass the `response` object to be parsed for its metadata
  const metadata = await urlMetadata(null, { parseResponseObject: response });
  console.log(metadata);
} catch (err) {
  console.log(err);
}
ajmas commented 7 months ago

So exploring a bit more I can create a response object for my HTML this way:

      const response = new Response(html, { headers: {
        'Content-Type': 'text/html'
      }});

The problem is now that parseResponseObject is not part of urlMetadata.Options in the Typescript definition in index.d.ts. I can get around this by casting the options to any, but it isn't ideal.

I'll open a specific ticket for the issue. I was using url-metadata 3.5.2. Now opened: https://github.com/laurengarcia/url-metadata/issues/76

laurengarcia commented 7 months ago

Glad you figured it out.

I'll add an example in the /test dir for the use-case above where you created a Response object for your html string and link it in the README. Thanks for bringing this to my attention.