Closed gajus closed 1 year ago
It is a peer dependency of https://github.com/spamscanner/url-regex-safe, turns out.
Is it worth evaluating just using a more basic URL regex?
url-regex-safe
is an overkill; and copiling re2 is taking a ton of time.
It looks like metascraper could get away with simple using new URL()
Yes, it's used 🙂
If r2 installation fails, still url-regex-safe
can work
https://github.com/spamscanner/url-regex-safe/blob/master/src/index.js#L5
@Kikobeats But is it needed given that you can achieve the same check using new URL()
?
Unfortunately it's needed. Maybe we can find a mechanism to skip re2 installation, like using an environment variable.
Could we update metascraper API such that users can provide their own RegExp constructor and instruct them to use re2 if possible?, i.e.
const SafeRegExp = (() => {
try {
const RE2 = require('re2');
return typeof RE2 === 'function' ? RE2 : RegExp;
} catch {
return RegExp;
}
})();
const metascraper = createMetascraper([
createMetascraperTitleRule(),
createMetascraperDescriptionRule(),
createMetascraperImageRule(),
], {RegExp: SafeRegExp});
Can you paste the installation error you are getting?
Theoretically it should be work even re2 installation failed.
It is not the installation error, it is the
RE 1:
If you create an Dockerfile with:
FROM node:19-bullseye
RUN npm install --global pnpm@^7.30.3 turbo@^1
RUN pnpm install re2
then just the RUN pnpm install re2
step takes 2 minutes and 5 seconds, which adds up to a lot of build time overhead. For context, removing metascraper
(and therefore re2) from our dependency tree reduces the install time down to less than 30 seconds for all dependencies.
RE 2:
We use esbuild
to bundle our Node.js services, and currently re2
is the only dependency that cannot be bundled (being that it is a native dependency), which requires to pnpm install
, which otherwise we wouldn't need to do at all in production build because we already have all the other code bundled.
what if you use pnpm override to avoid install it? https://pnpm.io/package_json#pnpmoverrides
I don't feel comfortable removing re2 since it's there for a good reason, but I have to explore a mechanism to opt-out if that's the thing the user wants.
To me, something like RE2_SKIP_INSTALL=true
sounds ideal
what if you use pnpm override to avoid install it? https://pnpm.io/package_json#pnpmoverrides
It is not ideal, but we can.
I say it is not ideal because overriding internal dependencies tends to introduce very hard to debug issues long-term. Been there a few times.
I don't feel comfortable removing re2 since it's there for a good reason, but I have to explore a mechanism to opt-out if that's the thing the user wants.
To me something like
RE2_SKIP_INSTALL=true
sounds ideal
I would make this more explicit, like METASCRAPER_RE2_SKIP_INSTALL=true
@Kikobeats I was confused why we are still installing it, but it looks like re2 is actually a hard dependency of @metascraper/helpers
/@metascraper/helpers@5.33.5:
resolution: {integrity: sha512-gcULKpM00CNxlf7iWRTi4hQQIXWQUjeFal0V5U60C4P4YyfLXfjuQVBk6mmKSYENSRh7oBQhAR+YVnMalVWBcw==}
engines: {node: '>= 12'}
dependencies:
audio-extensions: 0.0.0
chrono-node: 2.5.0
condense-whitespace: 2.0.0
entities: 4.4.0
file-extension: 4.0.5
has-values: 2.0.1
image-extensions: 1.1.0
is-relative-url: 3.0.0
is-uri: 1.2.4
iso-639-3: 2.2.0
isostring: 0.0.1
jsdom: 21.1.1
lodash: 4.17.21
memoize-one: 6.0.0
microsoft-capitalize: 1.0.5
mime: 3.0.0
normalize-url: 6.1.0
re2: 1.18.0
smartquotes: 2.3.2
tldts: 5.7.103
url-regex-safe: 3.0.0(re2@1.18.0)
video-extensions: 1.2.0
transitivePeerDependencies:
- bluebird
- bufferutil
- canvas
- supports-color
- utf-8-validate
dev: true
If this can at least be made a peer dependency, then we can just skip installing it.
This was painful but hopefully helps others https://github.com/uhop/node-re2/issues/163#issuecomment-1507493313
Yes, it's a hard dependency. I want to keep it unless you explicit opt-out passing METASCRAPER_RE2_SKIP_INSTALL
is there a proper way to fix re2? im using nextjs 13 and get re2.node error when compiled
Looking forward to that METASCRAPER_RE2_SKIP_INSTALL
here 💪 :) - it's preventing me from using this in a project
is there a proper way to fix re2? im using nextjs 13 and get re2.node error when compiled
I was having the same issue with next 13. i tried excluding re2 from the webpack bundle but that didn't help. I just built my own solution using Cheerio. Just grab the html from your request url, load it with cheerio, and search for the specific metadata you need. Ofc, add in the appropriate fallbacks where necessary to improve accuracy. It's pretty simple and you wont have to deal with that re2 node error.
Here is an example of what i did:
"use server"
import cheerio from 'cheerio';
export const getMetaImage = (html: string): string => {
const $ = cheerio.load(html);
return $('meta[property="og:image"]').attr('content')!
?? $('meta[name="twitter:image"]').attr('content')
?? $('.post-image').attr('src')
?? $('.entry-image').attr('src');
};
+1
+1
just started having a related issue building a node backend service that uses metascraper
#!/bin/bash -eo pipefail
docker build --rm=false -t gcr.io/${GCP_PROJECT}/${SERVICE_NAME}:$CIRCLE_TAG .
Sending build context to Docker daemon 13.47MB
Step 1/8 : FROM node:18.17.0-alpine
18.17.0-alpine: Pulling from library/node
a8db6415: Pulling fs layer
2f1a5d31: Pulling fs layer
b7606c1a: Pulling fs layer
Digest: sha256:58878e9e1ed3911bdd675d576331ed8838fc851607aed3bb91e25dfaffab3267
Status: Downloaded newer image for node:18.17.0-alpine
---> f1fac320ae0c
Step 2/8 : WORKDIR /home/app
---> Running in 048cc5001d44
---> 8fcfa90cc1b8
Step 3/8 : COPY package.json package-lock.json /home/app/
---> d8292bd4b24a
Step 4/8 : RUN npm config set update-notifier false
---> Running in 709aca441111
---> 0ffe21789570
Step 5/8 : RUN npm install --quiet --production
---> Running in 81143fdafca6
npm WARN config production Use `--omit=dev` instead.
npm WARN deprecated uglify-es@3.3.9: support for ECMAScript is superseded by `uglify-js` as of v3.13.0
npm ERR! code 1
npm ERR! path /home/app/node_modules/re2
npm ERR! command failed
npm ERR! command sh -c install-from-cache --artifact build/Release/re2.node --host-var RE2_DOWNLOAD_MIRROR --skip-path-var RE2_DOWNLOAD_SKIP_PATH --skip-ver-var RE2_DOWNLOAD_SKIP_VER || npm run rebuild
npm ERR! Trying https://github.com/uhop/node-re2/releases/download/1.20.1/linux-musl-x64-108.br ...
npm ERR! Writing to build/Release/re2.node ...
npm ERR! The verification has failed: building from sources ...
npm ERR! Building locally ...
npm ERR!
npm ERR! > re2@1.20.1 rebuild
npm ERR! > node-gyp rebuild
npm ERR!
npm ERR!
npm ERR! > re2@1.20.1 rebuild
npm ERR! > node-gyp rebuild
npm ERR! gyp ERR! find Python
npm ERR! gyp ERR! find Python Python is not set from command line or npm configuration
it seems re2 is failing and it is trying to build locally (the build worked before).
for info... seems the issue above with re2 and node alpine started with 1.20.1 - see https://github.com/uhop/node-re2/issues/180
hope this can help https://github.com/microlinkhq/metascraper/pull/656 although your issue is related with installation time rather than execution time
Did anyone here find a workable solution to exclude re2 from the build? #656 only helps at runtime, so it doesn't fix the issue for compiled apps (next, webpack, esbuild, vite, etc). Is this patch the best option?
Unless I am missing something,
re2
is added as a dependency but it is not actually being used anywhere.