medialab / sandcrawler-dashboard

A handy terminal dashboard plugin for sandcrawler.
20 stars 2 forks source link

Multiple spiders in one dashboard #15

Open hanbzu opened 9 years ago

hanbzu commented 9 years ago

Is it possible to use the same dashboard for two spiders?

My approach was declaring const myDashboard = dashboard() and then .use(myDashboard) at each of the spiders but it doesn't seem to work that way.

Thank you!

Yomguithereal commented 9 years ago

Hello @hanbzu, I must admit I did not conceive the dashboard to be used with multiple spiders at once. More than an implementation problem, there would be a design problem here. How do you think the dashboard should handle multiple spiders?

hanbzu commented 9 years ago

My use case is as follows:

I'm scrapping multiple pages with lists on them. The items in these lists have some basic data fields and a link to a page where more data for an specific item is available. All I want to do is to enrich the data I get on each of the lists with the additional data from each item. A fairly common use case, I believe.

To do so I've built two scrappers. One for the list page (list-scrapper) and another for the page belonging to a specific item (item-scrapper). Once the list-scrapper gets data it adds the URLs found to the item-scrapper.

Out of the animation in the sandcrawler-dashboard README and the way the scrappers are named with strings I thought the dashboard was conceived to be used by multiple spiders. Nevertheless it's possible that my way of approaching this use case is not the optimal. I'm just trying things out for now ;)

Yomguithereal commented 9 years ago

Is your scrapper open-sourced somehow?

RouxRC commented 9 years ago

Personally when i need to do this, I add a meta in the requests calls that I can check for it in the scraper or results calls to know which parsing to apply in the process.

hanbzu commented 9 years ago

@Yomguithereal my spider saves train information out of one national operator in Europe. It's not open-sourced for now, since I have not valued yet whether that would be a good idea. Nevertheless the code is quite simple, I can show snippets if you wish.

The question was rather conceptual now that I see @RouxRC 's answer. I thought Sandcrawler was designed with the idea of having multiple instances of spiders that do different tasks, meaning that we could throw at them jobs related to the task they know how to do.

Nevertheless, the spider can be unified into a monolith that knows how to scrape different pages and tell them apart along the lines of @RouxRC 's answer. I think I will try that approach out. The only scenario where that wouldn't work is having standard and Phantom spiders coexisting, I believe.

Yomguithereal commented 9 years ago

Both scenarios would work indeed and choosing one really depends on your particular case. The problem here is more with the dashboard itself which hasn't been conceived originally to handle more than a single spider at a time. But it would indeed be interesting to allow for several spiders to profit from a single dashboard.

So how would you see things for such a dashboard @hanbzu?

hanbzu commented 9 years ago

What I was expecting was using it like this:

let myDashboard = dashboard()

let connectionsSpider = sandcrawler.spider('connections')
  .use(myDashboard)
  .scraper(connectionsScraper)
  .result(function(err, req, res) {
    if (err) return

    const serviceFeeds = res.data.map(_ => serviceUrlToFeed(_.url))
    serviceSpider.addUrls(serviceFeeds)    
  })

let serviceSpider = sandcrawler.spider('service')
  .use(myDashboard)
  .scraper(serviceScraper)
  .result(function(err, req, res) {
    ...
  })

export default function main() {

  connectionsSpider
    .urls(queries.map(connectionsQuery2feed))
    .run()

  serviceSpider.run()
}

And just having the logs and totals updated in the dashboard with no other way of identifying the spider other than its own name: "MyJawa/info" in the README animation could be "MyBlueJawa/info", "MyOrangeJawa/info".

Otherwise I have to choose only one spider to show in the dashboard and that pushes me to use @RouxRC solution.

Now, I'm unaware if that requires a few changes or not.