medialab / artoo

artoo.js - the client-side scraping companion.
http://medialab.github.io/artoo/
MIT License
1.1k stars 93 forks source link

artoo.ajaxSpider on dynamic data #285

Closed chrisvasey closed 5 years ago

chrisvasey commented 5 years ago

Hi there!

I have been using the artoo.waitFor helper to load dynamic content on a single page. The issue I have faced is that when using the ajaxSpider method I am unable to get artoo to run with the .waitFor method as it just returns the HTML.

Is it possible for me to crawl the dynamic content or am I miss understanding?

chrisvasey commented 5 years ago

The content I am trying to scrape is set using JS so pulling the DOM does not help sadly

Yomguithereal commented 5 years ago

Yes, dynamic content is the limit here. If what you retrieve is html you could try to inject it into an iframe or something but this is a longshot. Seems you need browser emulation at this point. Or a chrome/firefox extension taking control over your browser.

chrisvasey commented 5 years ago

Thank you, I suspected as much but wondered if anyone had ever achieved it within the library.

Sadly the client I am using needs to fetch some data but I can’t install any dependencies only inject JS so other methods of scraping in Python/Node aren’t applicable.

Thank you for your input!

Yomguithereal commented 5 years ago

Sadly the client I am using needs to fetch some data

What is the format of this data? JSON? HTML? Have you tried retro-engineering the API?

chrisvasey commented 5 years ago

Sorry I could have been clearer - I need to fetch data from another page on the website using only injected JS snippets.

Up till now this has worked fine with jQuery .load() and Ajax calls but now I am trying to do the same to fetch data that is dynamically rendered with JS so these tools do not return the loaded values.

It may be possible to get the calls the page is making it’s self and attempt to pull the info from this but I would need to do this in a lot of places hence why scraping the exact values would have been ideal!

santteegt commented 5 years ago

Hi @chrisvasey,

Have you achieved to fetch dynamic data using artoo?

chrisvasey commented 5 years ago

Hi @santteegt, In the current setup I am not sure it is possible in JS. I ended up using python to achieve this task which is a shame because I could not do it straight in the browser.

I will close this thread as I didn't realise it was still open.

abduraufsherkulov commented 5 years ago

I was only able to use recursive with setInterval. If you use waitFor or loop, it doesn't work due to the fact that JS is an asynchronous language.

Yomguithereal commented 5 years ago

If you use waitFor or loop, it doesn't work due to the fact that JS is an asynchronous language.

What do you mean? waitFor actually uses setInterval. A loop won't work indeed because it will freeze the main stack.