omrilotan / isbot

🤖/👨‍🦰 Detect bots/crawlers/spiders using the user agent string
https://isbot.js.org/
The Unlicense
905 stars 74 forks source link

Pattern not being removed when calling "exclude" #166

Closed jannone closed 2 years ago

jannone commented 2 years ago

Steps to reproduce

const isBot = require('isbot');

const userAgent = 'Mozilla/5.0 (Linux; Android 10; SNE-LX1 Build/HUAWEISNE-L21; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/94.0.4606.71 Mobile Safari/537.36 hap/1079/huawei com.huawei.fastapp/11.4.1.310 com.bla.quickapp/4.0.17 ({"packageName":"quickSearch","type":"other","extra":"{}"})';

isBot.exclude(['search']);

console.log(isBot(userAgent));
console.log(isBot.find(userAgent));

Expected behaviour

search should have been removed from the pattern list, and the example userAgent should have not be identified as a bot.

Actual behaviour

search is still kept in the pattern list, and the userAgent is still identified as a bot.

omrilotan commented 2 years ago

You're right. The term "search" has an amendment for Yandex app browser. On supported platforms, it will be replaced with (?<! (ya|yandex))search. This is why the following combination is optional for you

isBot.exclude(['search', '(?<! (ya|yandex))search']);
omrilotan commented 2 years ago

I think I have a good solution for this in #167. Care to take a look?

jannone commented 2 years ago

I'm unsure... In my scenario, do I need to exclude both expressions? If so, #167 might not solve it... since it only returns the first expression that is found, rather than an array of expressions

omrilotan commented 2 years ago

In your scenario, you'll need to exclude only one because it gets replaced. Still, it could happen that you'll need to exclude more than one. I'm going to have another attempt tomorrow, I think I'll add a feature that clears all rules that would catch a user agent string.

omrilotan commented 2 years ago

Updated the pull request to include new functionality

https://github.com/omrilotan/isbot/pull/167

Add "matches" and "clear" methods used to clear patterns matching a user agent string.

This will be a suitable solution for such cases. The example below includes extra steps to display the journey.

For example:

const ua = 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0 SearchRobot/1.0'

sbot(ua) // true
isbot.find(ua) // 'Search'
isbot.matches(ua) // [ '(?<! cu)bot', '(?<! (ya|yandex))search' ]
isbot.clear(ua)
isbot(ua) // false
isbot.find(ua) // null
jannone commented 2 years ago

The PR would help, but may not solve the issue. I'll need to exclude and extend such that a "quickSearch" string won't be picked up. When I do this:

const isBot = require('isbot');

const userAgent = 'Mozilla/5.0 (Linux; Android 10; SNE-LX1 Build/HUAWEISNE-L21; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/94.0.4606.71 Mobile Safari/537.36 hap/1079/huawei com.huawei.fastapp/11.4.1.310 com.bla.quickapp/4.0.17 ({"packageName":"quickSearch","type":"other","extra":"{}"})';

isBot.exclude(['search', '(?<! (ya|yandex))search']);
isBot.extend(['(?<! (ya|yandex|quick))search'])

console.log(isBot(userAgent));
console.log(isBot.find(userAgent));

It still returns true. Am I missing something?

omrilotan commented 2 years ago

Because quickSearch doesn't have a white space before it like in the pattern. This should do the trick:

isBot.extend([ '(?<!( ya| yandex|quick))search' ])
jannone commented 2 years ago

Good catch! I hadn't seen the white space. It works now

omrilotan commented 2 years ago

I just want to recommend, that if you rely on the internal patterns of this list, lock your application on a patch version. The contents of the list may change over patch versions.

jannone commented 2 years ago

Yeah, good call. The team working on this code will want to upgrade package versions. I'll probably end up adding an allow-list on the outside, to guarantee that certain user-agents are always "not a bot", and thus bypass the isbot check