omrilotan / isbot

🤖/👨‍🦰 Detect bots/crawlers/spiders using the user agent string
https://isbot.js.org/
The Unlicense
921 stars 77 forks source link

"Invalid regular expression" on older ES versions (lack of negative lookbehind support) #236

Closed jlowcs closed 7 months ago

jlowcs commented 9 months ago

Steps to reproduce

install es-check

create a file with the regex from the release:

var regex = / daum[ /]| deusu\/| yadirectfetcher|(?:^| )site|(?:^|[^g])news|(?<! (?:channel\/|google\/))google(?!(app|\/google| pixel))|(?<! cu)bot(?:[^\w]|_|$)|(?<! ya(?:yandex)?)search|(?<!(?:lib))http|(?<![hg]m)score|@[a-z]|\(at\)[a-z]|\(github\.com\/|\[at\][a-z]|^12345|^<|^[\w \.\-\(?:\):]+(?:\/v?\d+(\.\d+)?(?:\.\d{1,10})?)?(?:,|$)|^[^ ]{50,}$|^active|^ad muncher|^amaya|^anglesharp\/|^avsdevicesdk\/|^bidtellect\/|^biglotron|^bot|^btwebclient\/|^clamav[ /]|^client\/|^cobweb\/|^coccoc|^custom|^ddg[_-]android|^discourse|^dispatch\/\d|^downcast\/|^duckduckgo|^facebook|^fdm[ /]\d|^getright\/|^gozilla\/|^hatena|^hobbit|^hotzonu|^hwcdn\/|^jeode\/|^jetty\/|^jigsaw|^linkdex|^metauri|^microsoft bits|^movabletype|^mozilla\/\d\.\d \(compatible;?\)$|^mozilla\/\d\.\d \w*$|^navermailapp|^netsurf|^offline explorer|^php|^postman|^postrank|^python|^rank|^read|^reed|^rest|^snapchat|^space bison|^svn|^swcd |^taringa|^thumbor\/|^tumblr\/|^user-agent:|^valid|^venus\/fedoraplanet|^w3c|^webbandit\/|^webcopier|^wget|^whatsapp|^xenu link sleuth|^yahoo|^yandex|^zdm\/\d|^zoom marketplace\/|^{{.*}}$|adbeat\.com|appinsights|archive|ask jeeves\/teoma|bit\.ly\/|bluecoat drtr|browsex|burpcollaborator|capture|catch|check|chrome-lighthouse|chromeframe|classifier|cloud|crawl|cryptoapi|dareboost|datanyze|dataprovider|dejaclick|dmbrowser|download|evc-batch\/|feed|firephp|freesafeip|gomezagent|headless|httrack|hubspot marketing grader|hydra|ibisbrowser|images|inspect|iplabel|ips-agent|java(?!;)|library|mail\.ru\/|manager|monitor|neustar wpm|nutch|offbyone|optimize|pageburst|pagespeed|parser|perl|phantom|pingdom|powermarks|preview|proxy|ptst[ /]\d|reader|reputation|resolver|retriever|rexx;|rigor|robot|rss|scan|scrape|server|sogou|sparkler\/|speedcurve|spider|splash|statuscake|stumbleupon\.com|supercleaner|synapse|synthetic|torrent|tracemyfile|transcoder|twingly recon|url|virtuoso|wappalyzer|webglance|webkit2png|whatcms\/|wordpress|zgrab/i;

run es-check es2017 on that file

Expected behaviour

It should pass the check, or the README should explicitely say that older browsers are not supported.

Actual behaviour

It does not pass the check

omrilotan commented 9 months ago

Thank you for the Issue. I will list supported items for version 4 and review using es-check. It should match supported devices for regular expression "lookbehind" feature. Isbot version 3 supports engines that does not support this feature.

jlowcs commented 9 months ago

Do you plan on maintaining v3 with new/updated bot user agents?

omrilotan commented 9 months ago

That's a very good question. The non-lookbehind patterns include some false-positives. For example, a lot of android apps include the webview substring "Channel/googleplay", where the simplistic "google" substring pattern will recognise as a bot. That's why I think supporting version 3 pattern updates is not a priority.

Regular expression lookbehind is now widely supported.

Source: https://caniuse.com/?search=Lookbehind

Instead - I might create a shorter, more simplistic and more optimistic option for engines that do not support lookbehind as a fallback. This could mitigate your case as well.

For example:

let pattern;
try {
 pattern = new Regexp("(?<! cu)bot(?:[^\\w]|_|$)", "i");
} catch (error) {
 pattern = /bot/i;
}

What is your opinion about this feature?

jlowcs commented 9 months ago

That's a very good question. The non-lookbehind patterns include some false-positives. For example, a lot of android apps include the webview substring "Channel/googleplay", where the simplistic "google" substring pattern will recognise as a bot. That's why I think supporting version 3 pattern updates is not a priority.

Regular expression lookbehind is now widely supported.

True, although Safari only started supporting it in 2023...

A quarter of our users seem to be on older browser versions (ie over 4 versions old), for which we serve a less optimized bundle. As we're providing a platform for social change, we want to be mindful of the fact that not everyone have access to modern browsers, but we still want them to be able to use the platform.

Instead - I might create a shorter, more simplistic and more optimistic option for engines that do not support lookbehind as a fallback. This could mitigate your case as well.

For example:

let pattern;
try {
 pattern = new Regexp("(?<! cu)bot(?:[^\\w]|_|$)", "i");
} catch (error) {
 pattern = /bot/i;
}

What is your opinion about this feature?

I like this idea!

omrilotan commented 9 months ago

https://github.com/omrilotan/isbot/pull/237

This changeset adds a fallback for engines that do not support lookbehind in regular expressions. This fallback is less accurate but catches ~78% of bots with less than 1% false-positive. It is also exposed as "isbotNaive" export.

Happy if you would take a look

omrilotan commented 9 months ago

The only regretful side effect is the pattern can not be tree-shaken for users who don't use the "isbot" named export.

jlowcs commented 9 months ago

The only regretful side effect is the pattern can not be tree-shaken for users who don't use the "isbot" named export.

Would using a IIFE fix that?

export const pattern = (() => {
  try {
    return new RegExp(expression, "i");
  } catch (error) {
    return naiveExpression;
  }
})();
omrilotan commented 9 months ago

No, because iife run anyways it will be included in all output versions. I'm not very worried about this, most of usage is for "isbot" function anyway.

jlowcs commented 9 months ago

No, because iife run anyways it will be included in all output versions. I'm not very worried about this, most of usage is for "isbot" function anyway.

I think it depends on whether you set "sideEffects": false in your package.json, but I might be wrong.

https://webpack.js.org/guides/tree-shaking/

omrilotan commented 9 months ago

No, because iife run anyways it will be included in all output versions. I'm not very worried about this, most of usage is for "isbot" function anyway.

I think it depends on whether you set "sideEffects": false in your package.json, but I might be wrong.

https://webpack.js.org/guides/tree-shaking/

It's about what runs in-scope and what has to run on "module load". I was able to separate the two exports by building both the string and the regex as individual exports.

omrilotan commented 9 months ago

Version 4.4.0 should be okay when importing isbot.

import { isbot } from  "isbot"

I greatly appreciate your involvement in this issue. It helped me reach a better outcome.

akselinurmio commented 8 months ago

Thank you for fixing this so quickly, this makes it possible for us to upgrade to v4 again. :)

bfaulk96 commented 8 months ago

@omrilotan It appears this is still an issue for Safari versions that don't support lookbehind (In my case, Safari 15.5).

omrilotan commented 7 months ago

@bfaulk96 appologies for the delay. Proposed resolution: #243

bfaulk96 commented 7 months ago

@bfaulk96 appologies for the delay. Proposed resolution: #243

No worries! Thanks for the quick turnaround, though! #243 looks like it would do the trick!

omrilotan commented 7 months ago

Resolution of isbot@5 tested on Safari 15.6

bfaulk96 commented 7 months ago

Can confirm, isbot@5 is working now for my Safari 15.5 use-case. Thanks again for the quick turnaround on that!