omrilotan / isbot

🤖/👨‍🦰 Detect bots/crawlers/spiders using the user agent string
https://isbot.js.org/
The Unlicense
929 stars 77 forks source link

npm script to update agents list #2

Closed stevenvachon closed 8 years ago

stevenvachon commented 8 years ago

using something like http://www.user-agents.org/

gorangajic commented 8 years ago

This module works by matching user agent with bot list. User agents are only used for testing purpose so not sure why would we need to do that?

stevenvachon commented 8 years ago

Bot lists change.

gorangajic commented 8 years ago

bot list updated #3

stevenvachon commented 8 years ago

3 is good and all, but it doesn't provide any tools to update in the future.

gorangajic commented 8 years ago

I am open to changes, if you want to provide that functionality it would be great

stevenvachon commented 8 years ago

It would undo some of the additions made in #3, though, as I would probably use only user-agents.org

timbowhite commented 8 years ago

Hey heads up that user-agents.org is really outdated. I mentioned over in the bot-detector module issue, which uses user-agents.org exclusively, that it seems that many of user-agents.org ua strings haven't been updated in 10 years.

Some other sources to consider:

For a script performing automatic updates, any sources would have to be trusted to be accurate both now and in the future (ie. I spent yesterday picking through ua strings for #3 and found a decent amount inaccuracies in sources). So I think a 100% automated npm update script puts too much trust in sources, perhaps a good middleground be would to:

  1. Write a dev script to pull both browsers and bots ua strings from various sources, and add new ones to the browsers.txt and crawlers.txt respectively.
  2. Test if the existing list.js regex strings still pass. If not, manually inspect the ua string test failures and add/modify list.js to satisfy both the existing and new ua strings. Might be able to script this somewhat as well.
  3. Commit and bump version.

If someone writes such a sources updating script, I'd be willing to take on step 2 above as a maintainer and run it weekly because isbot is going to be an integral part of an open-source bot detection project I'm working on and will be releasing soon.

gorangajic commented 8 years ago

If someone writes such a script that update browser.txt and crawlers.txt we can use ci service to run tests automatically once a day