Closed ThinkDigitalSoftware closed 6 years ago
Hum... That's unfortunate but HackerNews just updated its header to include Content-Security-Policy
thus forbidding arbitrary script execution. You'll have to use a browser extension bypassing those headers and I should probably find another site as example in my docs :)
No worries. I figured as much. Thanks for the response. Where can I ask for help with using artoo that's unrelated to this issue?
Well here seems to be a good place to do so :)
Awesome. I'm trying to select items by class name under a certain tag, but I can't find a way to do so with artoo. The best I get is selecting all the p tags, but that's gives me more results I can't use than results I can. Let me post an example
On Wed, Apr 4, 2018, 8:48 AM Guillaume Plique notifications@github.com wrote:
Well here seems to be a good place to do so :)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/artoo/issues/280#issuecomment-378649066, or mute the thread https://github.com/notifications/unsubscribe-auth/AV-HfcgYj4KCbiSO1lZk_u8MnkNHhXoOks5tlOtKgaJpZM4TGKsC .
To select items by tag + class, here is what you need to write in CSS:
tagname.class
So, using artoo, you'd probably do something of the kind:
artoo.scrape('tagname.class', ...);
oh! OK, I was putting a space... OK, thank you. It would also help you if you could continue using your site as the example so we can stay on the page while we work the tutorial :)
On Wed, Apr 4, 2018 at 8:58 AM, Guillaume Plique notifications@github.com wrote:
To select items by tag + class, here is what you need to write in CSS:
tagname.class
So, using artoo, you'd probably do something of the kind:
artoo.scrape('tagname.class', ...);
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/artoo/issues/280#issuecomment-378652640, or mute the thread https://github.com/notifications/unsubscribe-auth/AV-HfRa9U8WoYm4MLvwBive8AmQrbQ-5ks5tlO2qgaJpZM4TGKsC .
-- Think Digital 323-638-9448 760-678-8833 Facebook.com/ThinkDigitalRepair
OK, so on this page, I'm running
I'm running artoo.scrape('li.card-btn square ', { text: {sel: 'span', method: 'text'}, url: {sel: 'a', attr: 'href'} });
and I'm getting an empty array. I isolated the element that's on the page and pasted it on this pastebin service.
https://dpaste.de/Oz5n
what I wan't to pull out from the page results that look like this
{
name: 'Yelena M Stepanenko',
address: 'Spc 157'
}
What am I doing wrong? I also realize that the selector is wrong. I haven't gotten to that part yet I have no CSS background. I'm more of a desktop programmer, so It's a little slower for me to figure this out. Thanks for your patience.
selector should be li.card-btn.square
since you attempt to match two classes.
Could you type it out? I honestly dont understand it. I just need to get it to match once and I can figure the rest out
On Wed, Apr 4, 2018, 10:49 AM Guillaume Plique notifications@github.com wrote:
selector should be li.card-btn.square since you attempt to match two classes.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/artoo/issues/280#issuecomment-378687244, or mute the thread https://github.com/notifications/unsubscribe-auth/AV-Hfe_W0NLwRfc7OV0-5oXlu1Z6CgSkks5tlQeWgaJpZM4TGKsC .
artoo.scrape('li.card-btn.square', { text: {sel: 'span', method: 'text'}, url: {sel: 'a', attr: 'href'} });
Oh, you're saying the card-button square is listed as two classes in the html? That's because of the space that's in the class name?
On Wed, Apr 4, 2018, 10:59 AM Guillaume Plique notifications@github.com wrote:
artoo.scrape('li.card-btn.square', { text: {sel: 'span', method: 'text'}, url: {sel: 'a', attr: 'href'} });
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/artoo/issues/280#issuecomment-378690473, or mute the thread https://github.com/notifications/unsubscribe-auth/AV-HfR7daq42er_Da9zxSnX_vVKfPi5yks5tlQoCgaJpZM4TGKsC .
Yes. You have several classes listed in your example. You should probably do a quick html/css tutorial before scraping. It will definitely help you achieve your goals. Scraping is basically html/css retro-engineering.
You're amazing, thank you. I'll do more research on this
On Wed, Apr 4, 2018 at 11:04 AM, Guillaume Plique notifications@github.com wrote:
Yes. You have several classes listed in your example. You should probably do a quick html/css tutorial before scraping. It will definitely help you achieve your goals. Scraping is basically html/css retro-engineering.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/artoo/issues/280#issuecomment-378692145, or mute the thread https://github.com/notifications/unsubscribe-auth/AV-HfS9Y0xRbvK5zKclcLzhnBYAhEjrWks5tlQtHgaJpZM4TGKsC .
-- Think Digital 323-638-9448 760-678-8833 Facebook.com/ThinkDigitalRepair
Just going to close this. Researching Jquery and CSS taught me a lot about selectors!
I should probably find another site as example in my docs
Please do @Yomguithereal -- I need a working example as the sprint board to jump further. thx.
How about echojs.com?
Yeah, super.
While you are at it changing the scrapping code, please throw in some comment as well, as you helped me before:
artoo.ajaxSpider(
// This function is an iterator.
// Its aim is to return the next url to fecth or false if you want to stop
//-- 'i' is the index in the iteration of urls
//-- '$data' is the jQuery-parsed data of the last fetched url
function(i, $data) {
// nextUrl is a function that take a jQuery selector and returns
// the next url to fetch
// If !i then, we are only starting the spider meaning that the next url
// is available on the current page rather than the last fetched one.
return nextUrl(!i ? artoo.$(document) : $data);
},
// Spider's settings
{
// We want to fetch a maximum of two pages
limit: 2,
// We are going to scrape the pages using the scrape definition written above in the doc example
scrape: scraper,
// We want to concatenate results so we have [title1, title2, title3, title4]
// rather than [[title1, title2], [title3, title4]]
concat: true,
// Final callback fired when the spider retrieved everything
//-- 'data' is the scraped data
done: function(data) {
artoo.log.debug('Finished retrieving data. Downloading...');
artoo.savePrettyJson(
frontpage.concat(data),
{filename: 'hacker_news.json'}
);
}
}
);
thx
This error occurs when trying to follow to follow the initial tutorial and clicking the artoo bookmark
VM4037:1 Refused to load the script 'https://medialab.github.io/artoo/public/dist/artoo-latest.min.js' because it violates the following Content Security Policy directive: "script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://cdnjs.cloudflare.com/".