medialab / artoo

artoo.js - the client-side scraping companion.
http://medialab.github.io/artoo/
MIT License
1.1k stars 93 forks source link

news.ycombinator.com Refuses to load the script #280

Closed ThinkDigitalSoftware closed 6 years ago

ThinkDigitalSoftware commented 6 years ago

This error occurs when trying to follow to follow the initial tutorial and clicking the artoo bookmark VM4037:1 Refused to load the script 'https://medialab.github.io/artoo/public/dist/artoo-latest.min.js' because it violates the following Content Security Policy directive: "script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://cdnjs.cloudflare.com/".

Yomguithereal commented 6 years ago

Hum... That's unfortunate but HackerNews just updated its header to include Content-Security-Policy thus forbidding arbitrary script execution. You'll have to use a browser extension bypassing those headers and I should probably find another site as example in my docs :)

ThinkDigitalSoftware commented 6 years ago

No worries. I figured as much. Thanks for the response. Where can I ask for help with using artoo that's unrelated to this issue?

Yomguithereal commented 6 years ago

Well here seems to be a good place to do so :)

ThinkDigitalSoftware commented 6 years ago

Awesome. I'm trying to select items by class name under a certain tag, but I can't find a way to do so with artoo. The best I get is selecting all the p tags, but that's gives me more results I can't use than results I can. Let me post an example

On Wed, Apr 4, 2018, 8:48 AM Guillaume Plique notifications@github.com wrote:

Well here seems to be a good place to do so :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/artoo/issues/280#issuecomment-378649066, or mute the thread https://github.com/notifications/unsubscribe-auth/AV-HfcgYj4KCbiSO1lZk_u8MnkNHhXoOks5tlOtKgaJpZM4TGKsC .

Yomguithereal commented 6 years ago

To select items by tag + class, here is what you need to write in CSS:

tagname.class

So, using artoo, you'd probably do something of the kind:

artoo.scrape('tagname.class', ...);
ThinkDigitalSoftware commented 6 years ago

oh! OK, I was putting a space... OK, thank you. It would also help you if you could continue using your site as the example so we can stay on the page while we work the tutorial :)

On Wed, Apr 4, 2018 at 8:58 AM, Guillaume Plique notifications@github.com wrote:

To select items by tag + class, here is what you need to write in CSS:

tagname.class

So, using artoo, you'd probably do something of the kind:

artoo.scrape('tagname.class', ...);

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/artoo/issues/280#issuecomment-378652640, or mute the thread https://github.com/notifications/unsubscribe-auth/AV-HfRa9U8WoYm4MLvwBive8AmQrbQ-5ks5tlO2qgaJpZM4TGKsC .

-- Think Digital 323-638-9448 760-678-8833 Facebook.com/ThinkDigitalRepair

ThinkDigitalSoftware commented 6 years ago

OK, so on this page, I'm running I'm running artoo.scrape('li.card-btn square ', { text: {sel: 'span', method: 'text'}, url: {sel: 'a', attr: 'href'} }); and I'm getting an empty array. I isolated the element that's on the page and pasted it on this pastebin service. https://dpaste.de/Oz5n what I wan't to pull out from the page results that look like this

{
    name: 'Yelena M Stepanenko',
    address: 'Spc 157'
}

What am I doing wrong? I also realize that the selector is wrong. I haven't gotten to that part yet I have no CSS background. I'm more of a desktop programmer, so It's a little slower for me to figure this out. Thanks for your patience.

Yomguithereal commented 6 years ago

selector should be li.card-btn.square since you attempt to match two classes.

ThinkDigitalSoftware commented 6 years ago

Could you type it out? I honestly dont understand it. I just need to get it to match once and I can figure the rest out

On Wed, Apr 4, 2018, 10:49 AM Guillaume Plique notifications@github.com wrote:

selector should be li.card-btn.square since you attempt to match two classes.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/artoo/issues/280#issuecomment-378687244, or mute the thread https://github.com/notifications/unsubscribe-auth/AV-Hfe_W0NLwRfc7OV0-5oXlu1Z6CgSkks5tlQeWgaJpZM4TGKsC .

Yomguithereal commented 6 years ago
artoo.scrape('li.card-btn.square', { text: {sel: 'span', method: 'text'}, url: {sel: 'a', attr: 'href'} });
ThinkDigitalSoftware commented 6 years ago

Oh, you're saying the card-button square is listed as two classes in the html? That's because of the space that's in the class name?

On Wed, Apr 4, 2018, 10:59 AM Guillaume Plique notifications@github.com wrote:

artoo.scrape('li.card-btn.square', { text: {sel: 'span', method: 'text'}, url: {sel: 'a', attr: 'href'} });

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/artoo/issues/280#issuecomment-378690473, or mute the thread https://github.com/notifications/unsubscribe-auth/AV-HfR7daq42er_Da9zxSnX_vVKfPi5yks5tlQoCgaJpZM4TGKsC .

Yomguithereal commented 6 years ago

Yes. You have several classes listed in your example. You should probably do a quick html/css tutorial before scraping. It will definitely help you achieve your goals. Scraping is basically html/css retro-engineering.

ThinkDigitalSoftware commented 6 years ago

You're amazing, thank you. I'll do more research on this

On Wed, Apr 4, 2018 at 11:04 AM, Guillaume Plique notifications@github.com wrote:

Yes. You have several classes listed in your example. You should probably do a quick html/css tutorial before scraping. It will definitely help you achieve your goals. Scraping is basically html/css retro-engineering.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/artoo/issues/280#issuecomment-378692145, or mute the thread https://github.com/notifications/unsubscribe-auth/AV-HfS9Y0xRbvK5zKclcLzhnBYAhEjrWks5tlQtHgaJpZM4TGKsC .

-- Think Digital 323-638-9448 760-678-8833 Facebook.com/ThinkDigitalRepair

ThinkDigitalSoftware commented 6 years ago

Just going to close this. Researching Jquery and CSS taught me a lot about selectors!

suntong commented 5 years ago

I should probably find another site as example in my docs

Please do @Yomguithereal -- I need a working example as the sprint board to jump further. thx.

Yomguithereal commented 5 years ago

How about echojs.com?

suntong commented 5 years ago

Yeah, super.

While you are at it changing the scrapping code, please throw in some comment as well, as you helped me before:

artoo.ajaxSpider(

  // This function is an iterator.
  // Its aim is to return the next url to fecth or false if you want to stop
  //-- 'i' is the index in the iteration of urls
  //-- '$data' is the jQuery-parsed data of the last fetched url
  function(i, $data) {

    // nextUrl is a function that take a jQuery selector and returns
    // the next url to fetch

    // If !i then, we are only starting the spider meaning that the next url
    // is available on the current page rather than the last fetched one.
    return nextUrl(!i ? artoo.$(document) : $data);
  },

  // Spider's settings
  {

    // We want to fetch a maximum of two pages
    limit: 2,

    // We are going to scrape the pages using the scrape definition written above in the doc example
    scrape: scraper,

    // We want to concatenate results so we have [title1, title2, title3, title4]
    // rather than [[title1, title2], [title3, title4]]
    concat: true,

    // Final callback fired when the spider retrieved everything
    //-- 'data' is the scraped data
    done: function(data) {
      artoo.log.debug('Finished retrieving data. Downloading...');
      artoo.savePrettyJson(
        frontpage.concat(data),
        {filename: 'hacker_news.json'}
      );
    }
  }
);

thx