medialab / sandcrawler

sandcrawler.js - the server-side scraping companion.
http://medialab.github.io/sandcrawler/
GNU Lesser General Public License v3.0
107 stars 12 forks source link

Status? #192

Open brandondrew opened 8 years ago

brandondrew commented 8 years ago

Has this project been abandoned?

It looks very promising, other than the (apparent) lack of progress recently.

Yomguithereal commented 8 years ago

Hello @brandondrew. The project is not abandoned. The thing is a lot of things changed since the time I started and I will soon reboot both artoo & sandcrawler but I won't do so before at least October and if my projects need some crawling/scraping. But I am confident they will.

brandondrew commented 8 years ago

Very good news—thanks for the update!

Though I've only played around with artoo so far, the combination of artoo and sandcrawler appears to be the best option for crawling and scraping data off of the web. It's absolutely brilliant to have an in-browser option that complements a server-side option, sort of giving a REPL for scraping.

Do you expect the reboot to make significant changes to the API? (I'm toying around with an idea for a project that could rely heavily on sandcrawler.)

Yomguithereal commented 8 years ago

The reboot will probably make significant changes to the API indeed but should not steer to far from the existing concepts.

Yomguithereal commented 8 years ago

I also aim at trying different solutions than PhantomJS by clearly separating the engines of sandcrawler (static & phantom so far) to try and experiment with a headless electron and chromium because PhantomJS' quirks are really bugging me latetly (notably unavoidable crash & memory leaks).

boogheta commented 8 years ago

@Yomguithereal If it can help, manet does a good job at proposing a service relying both on PhantomJS and SlimerJS https://github.com/vbauer/manet

Yomguithereal commented 8 years ago

This wont address memory leak problems however and a reboot system of both PhantomJS and SlimerJS is needed anyway to clean memory off :(

boogheta commented 8 years ago

I wan only pointing it out as an example of abstractification over both engines :)

Schaemelhout commented 7 years ago

Any update on this?

Yomguithereal commented 7 years ago

Hello @Schaemelhout. Things will probably evolve by the end of the year. I'm sorry but I cannot be more precise. If you need specific bug fixes however, I can probably work it out.

Schaemelhout commented 7 years ago

Hi @Yomguithereal, I was just wondering how mature and solid this project is, I'm looking for a library to help me in my scraping-adventures.

The ones I have in mind are the following:

The sandcrawler looks very promising, but it looks like a quite abandonned project, and I was wondering if it was worth the effort of using it right now if it's going to get a complete overhaul by the end of the year..?

Thanks anyway! The project looks great.

Yomguithereal commented 7 years ago

Of the list you present, sandcrawler is probably the best choice if you need to perform complex tasks and need to customize very precise things in order to achieve what you need. If what you need is fairly simple and you won't need to handle the dark insanities of the whole web, maybe this tool is a bit overkill.

Can you explain to me what you intend to do so I can help you better (if you can disclose it, of course)?

Schaemelhout commented 7 years ago

I can't go into too much detail, but the main thing I need is just a decent queueing system and preferable a IP/proxy rotating system. The content discovery and scraping I need to do is pretty straightforward.

abbasharoon commented 7 years ago

Hi,

I just checked the github pages and the project is looking really promising. I waited for a month but I think you still don't have time. I gave it a try but was not able to get proper results. Can you kindly specify any expected date for the new version?

Thanks

Yomguithereal commented 7 years ago

I can't give you a date. But I can try to help you fix what fails for you.