microlinkhq / metascraper

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.
https://metascraper.js.org
MIT License
2.35k stars 168 forks source link

provide a way to avoid re2 installation #630

Closed gajus closed 1 year ago

gajus commented 1 year ago

Unless I am missing something, re2 is added as a dependency but it is not actually being used anywhere.

gajus commented 1 year ago

It is a peer dependency of https://github.com/spamscanner/url-regex-safe, turns out.

gajus commented 1 year ago

Is it worth evaluating just using a more basic URL regex?

url-regex-safe is an overkill; and copiling re2 is taking a ton of time.

gajus commented 1 year ago

It looks like metascraper could get away with simple using new URL()

Kikobeats commented 1 year ago

Yes, it's used 🙂

If r2 installation fails, still url-regex-safe can work https://github.com/spamscanner/url-regex-safe/blob/master/src/index.js#L5

gajus commented 1 year ago

@Kikobeats But is it needed given that you can achieve the same check using new URL()?

Kikobeats commented 1 year ago

Unfortunately it's needed. Maybe we can find a mechanism to skip re2 installation, like using an environment variable.

gajus commented 1 year ago

Could we update metascraper API such that users can provide their own RegExp constructor and instruct them to use re2 if possible?, i.e.

const SafeRegExp = (() => {
  try {
    const RE2 = require('re2');
    return typeof RE2 === 'function' ? RE2 : RegExp;
  } catch {
    return RegExp;
  }
})();

const metascraper = createMetascraper([
    createMetascraperTitleRule(),
    createMetascraperDescriptionRule(),
    createMetascraperImageRule(),
], {RegExp: SafeRegExp});
Kikobeats commented 1 year ago

Can you paste the installation error you are getting?

Theoretically it should be work even re2 installation failed.

gajus commented 1 year ago

It is not the installation error, it is the

  1. compile time and
  2. not being able to bundle code due to native dependency

RE 1:

If you create an Dockerfile with:

FROM node:19-bullseye

RUN npm install --global pnpm@^7.30.3 turbo@^1

RUN pnpm install re2

then just the RUN pnpm install re2 step takes 2 minutes and 5 seconds, which adds up to a lot of build time overhead. For context, removing metascraper (and therefore re2) from our dependency tree reduces the install time down to less than 30 seconds for all dependencies.

RE 2:

We use esbuild to bundle our Node.js services, and currently re2 is the only dependency that cannot be bundled (being that it is a native dependency), which requires to pnpm install, which otherwise we wouldn't need to do at all in production build because we already have all the other code bundled.

Kikobeats commented 1 year ago

what if you use pnpm override to avoid install it? https://pnpm.io/package_json#pnpmoverrides

I don't feel comfortable removing re2 since it's there for a good reason, but I have to explore a mechanism to opt-out if that's the thing the user wants.

To me, something like RE2_SKIP_INSTALL=true sounds ideal

gajus commented 1 year ago

what if you use pnpm override to avoid install it? https://pnpm.io/package_json#pnpmoverrides

It is not ideal, but we can.

I say it is not ideal because overriding internal dependencies tends to introduce very hard to debug issues long-term. Been there a few times.

gajus commented 1 year ago

I don't feel comfortable removing re2 since it's there for a good reason, but I have to explore a mechanism to opt-out if that's the thing the user wants.

To me something like RE2_SKIP_INSTALL=true sounds ideal

I would make this more explicit, like METASCRAPER_RE2_SKIP_INSTALL=true

gajus commented 1 year ago

@Kikobeats I was confused why we are still installing it, but it looks like re2 is actually a hard dependency of @metascraper/helpers

  /@metascraper/helpers@5.33.5:
    resolution: {integrity: sha512-gcULKpM00CNxlf7iWRTi4hQQIXWQUjeFal0V5U60C4P4YyfLXfjuQVBk6mmKSYENSRh7oBQhAR+YVnMalVWBcw==}
    engines: {node: '>= 12'}
    dependencies:
      audio-extensions: 0.0.0
      chrono-node: 2.5.0
      condense-whitespace: 2.0.0
      entities: 4.4.0
      file-extension: 4.0.5
      has-values: 2.0.1
      image-extensions: 1.1.0
      is-relative-url: 3.0.0
      is-uri: 1.2.4
      iso-639-3: 2.2.0
      isostring: 0.0.1
      jsdom: 21.1.1
      lodash: 4.17.21
      memoize-one: 6.0.0
      microsoft-capitalize: 1.0.5
      mime: 3.0.0
      normalize-url: 6.1.0
      re2: 1.18.0
      smartquotes: 2.3.2
      tldts: 5.7.103
      url-regex-safe: 3.0.0(re2@1.18.0)
      video-extensions: 1.2.0
    transitivePeerDependencies:
      - bluebird
      - bufferutil
      - canvas
      - supports-color
      - utf-8-validate
    dev: true

If this can at least be made a peer dependency, then we can just skip installing it.

gajus commented 1 year ago

This was painful but hopefully helps others https://github.com/uhop/node-re2/issues/163#issuecomment-1507493313

Kikobeats commented 1 year ago

Yes, it's a hard dependency. I want to keep it unless you explicit opt-out passing METASCRAPER_RE2_SKIP_INSTALL

fieztazica commented 1 year ago

is there a proper way to fix re2? im using nextjs 13 and get re2.node error when compiled

loteoo commented 1 year ago

Looking forward to that METASCRAPER_RE2_SKIP_INSTALL here 💪 :) - it's preventing me from using this in a project

JaleelB commented 1 year ago

is there a proper way to fix re2? im using nextjs 13 and get re2.node error when compiled

I was having the same issue with next 13. i tried excluding re2 from the webpack bundle but that didn't help. I just built my own solution using Cheerio. Just grab the html from your request url, load it with cheerio, and search for the specific metadata you need. Ofc, add in the appropriate fallbacks where necessary to improve accuracy. It's pretty simple and you wont have to deal with that re2 node error.

Here is an example of what i did:

"use server"
import cheerio from 'cheerio';

export const getMetaImage = (html: string): string => {
  const $ = cheerio.load(html);
  return $('meta[property="og:image"]').attr('content')!
    ?? $('meta[name="twitter:image"]').attr('content')
    ?? $('.post-image').attr('src')
    ?? $('.entry-image').attr('src');
};
zaosoula commented 1 year ago

+1

jcurlier commented 1 year ago

+1

just started having a related issue building a node backend service that uses metascraper

#!/bin/bash -eo pipefail
docker build --rm=false -t gcr.io/${GCP_PROJECT}/${SERVICE_NAME}:$CIRCLE_TAG .
Sending build context to Docker daemon  13.47MB
Step 1/8 : FROM node:18.17.0-alpine
18.17.0-alpine: Pulling from library/node

a8db6415: Pulling fs layer 
2f1a5d31: Pulling fs layer 
b7606c1a: Pulling fs layer 
Digest: sha256:58878e9e1ed3911bdd675d576331ed8838fc851607aed3bb91e25dfaffab3267
Status: Downloaded newer image for node:18.17.0-alpine
 ---> f1fac320ae0c
Step 2/8 : WORKDIR /home/app
 ---> Running in 048cc5001d44
 ---> 8fcfa90cc1b8
Step 3/8 : COPY package.json package-lock.json /home/app/
 ---> d8292bd4b24a
Step 4/8 : RUN npm config set update-notifier false
 ---> Running in 709aca441111
 ---> 0ffe21789570
Step 5/8 : RUN npm install --quiet --production
 ---> Running in 81143fdafca6
npm WARN config production Use `--omit=dev` instead.
npm WARN deprecated uglify-es@3.3.9: support for ECMAScript is superseded by `uglify-js` as of v3.13.0
npm ERR! code 1
npm ERR! path /home/app/node_modules/re2
npm ERR! command failed
npm ERR! command sh -c install-from-cache --artifact build/Release/re2.node --host-var RE2_DOWNLOAD_MIRROR --skip-path-var RE2_DOWNLOAD_SKIP_PATH --skip-ver-var RE2_DOWNLOAD_SKIP_VER || npm run rebuild
npm ERR! Trying https://github.com/uhop/node-re2/releases/download/1.20.1/linux-musl-x64-108.br ...
npm ERR! Writing to build/Release/re2.node ...
npm ERR! The verification has failed: building from sources ...
npm ERR! Building locally ...
npm ERR! 
npm ERR! > re2@1.20.1 rebuild
npm ERR! > node-gyp rebuild
npm ERR! 
npm ERR! 
npm ERR! > re2@1.20.1 rebuild
npm ERR! > node-gyp rebuild
npm ERR! gyp ERR! find Python 
npm ERR! gyp ERR! find Python Python is not set from command line or npm configuration

it seems re2 is failing and it is trying to build locally (the build worked before).

jcurlier commented 1 year ago

for info... seems the issue above with re2 and node alpine started with 1.20.1 - see https://github.com/uhop/node-re2/issues/180

Kikobeats commented 1 year ago

hope this can help https://github.com/microlinkhq/metascraper/pull/656 although your issue is related with installation time rather than execution time

aldenquimby commented 5 months ago

Did anyone here find a workable solution to exclude re2 from the build? #656 only helps at runtime, so it doesn't fix the issue for compiled apps (next, webpack, esbuild, vite, etc). Is this patch the best option?