medialab / artoo

artoo.js - the client-side scraping companion.
http://medialab.github.io/artoo/
MIT License
1.1k stars 93 forks source link

Roadmap and compatibility with chrome extension #288

Open hubyhuby opened 4 years ago

hubyhuby commented 4 years ago

Hi Artoo, This project is really cool.

But unfortunately with recent website security changes , the artoo library seems to not be working on most of the popular websites. I have tried to use the chrome extension without luck and see that there is no commits since 2018. Except for a new branch gh-pages ...

Hence my questions :

Yomguithereal commented 4 years ago

Hello @hubyhuby,

The chrome extension is indeed quite old now. But you can use other chrome extensions to basically do the same. The idea is just to shunt the Content Security Policy headers of the website, so something like this extension: https://chrome.google.com/webstore/detail/disable-content-security/ieelmcmcagommplceebfedjlakkhpden should work.

There is no precise roadmap. This tool has been working for years now and is still used by many. I occasionally fix some bugs and such but I have no precise features to add right now.

In the meantime, to work some other related use-cases more scale-oriented we started developing a CLI tool called minet that takes inspiration from artoo somehow (specifically the scraping DSL) and that will maybe, in the future, inject artoo into headless chrome contexts for complex tasks.

Yomguithereal commented 4 years ago

I've updated the online docs to help people shunt CSP headers and I kinda hid the extension's documentation because it really is irrelevant at this point.

hubyhuby commented 4 years ago

Thanks @Yomguithereal , I am developing a chrome extension to scrape info while browsering the web. I managed easily to include jquery in my chrome extension, without getting any error messages on security. But I never succeed in using artoo ... I am unsure where to or if it is possible to include artoo.js in my chrome extension ... I have tried with artoo-latest.min.js I may give it another try later, but for now I guess jquery is the way to scrape data ...

I am tryed things like this downloading artoo from the website, but get troubles with the .js trying to DL jquery ....

:

{ "manifest_version": 2,

"name": " test 2",
"description": "A simple page-scraping extension for Chrome",
"version": "1.0",
"author": "bob",

"background": {
    "scripts": ["popup.js","artoo-latest.min.js"],
    "persistent": true
},

"permissions": [
    "tabs",
    "http://*/",
    "https://*/"
],
"browser_action": {
    "default_icon": "logo.png",
    "default_popup": "popup.html"
},
"content_scripts": [ {
    "matches": ["http://*.youtube.com/*", "https://*.youtube.com/*"],
    "js": ["artoo-latest.min.js"]
}]

}