istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

UI for displaying information about Cluster #25

Open madisonb opened 9 years ago

madisonb commented 9 years ago

We need a small stand-alone web UI that ties in with the rest components in #24 to visualize the data generated by the cluster. You should also be able to submit API requests to the cluster.

Preferably this web ui and rest services are together and it is just deployed as a single running process.

Paw2Pom commented 8 years ago

something like this?

https://django-dynamic-scraper.readthedocs.io/en/latest/introduction.html

madisonb commented 8 years ago

Not a web interface for generating spider, but a web interface for visualizing the API data generated from working with the cluster. The REST services wrapper that interacts with Kafka would be able to use the same APIs as Kafka does, so it would allow the data returned to be ingested into a primitive series of pages that displays info about your scraping cluster.

For example:

Basically, anything you can get from the Kafka API you should be able to see when visiting the UI.

Paw2Pom commented 8 years ago

I think there are several choices (i.e. airflow, nifi) with good UI, but for now i think nifi maybe the best bet since airflow don't support kafka but only rabbitMQ for now.

Paw2Pom commented 8 years ago

Good illustrations in:

http://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka

http://stackoverflow.com/questions/39399065/airbnb-airflow-vs-apache-nifi

madisonb commented 8 years ago

Apache NiFi is nice but is not the UI I am looking for. Picture a very plain ui like for Hadoop, Hbase, Spark, Storm or another popular open source Apache project. It tells basic information about the cluster, and allows primitive manipulation of the controls within. The Rest services in #24 built via Flask, with a basic Angular (or different framework) front end.

The UI doesn't actually care what is behind the scenes, it just interacts with the rest server. The rest server then interacts with the cluster. It helps keeps things separated and abstract.

madisonb commented 8 years ago

Pushing this back to 1.3, with the focus on getting a solid rest service (#24) for 1.2

Paw2Pom commented 8 years ago

ok!

Will the rest service also tackle the problem of crawling the same website (i.e. news or ecommerce sites) over time, such that same/duplicated items will not be outputted?

madisonb commented 8 years ago

The rest service does not take care of that, it sounds like you would need a customized spider to do better duplicate detection, or to modify the very vanilla RFPDupefilter to handle detection of duplicates at a site wide scale. That class prevents duplicates from being crawled, but you may want to add similar logic to the item pipeline to drop items that have the same footprint criteria you need to meet your filtering needs.

If you have further questions about that please move the conversation to Gitter, as this issue is for UI conversation.

damienkilgannon commented 8 years ago

The Spark UI is a nice functional interface for handling Spark jobs. Something similar for scrapy-cluster would be a good addition to the project. Potential a central interface for scheduling tasks to be sent to the cluster through 'rest' and also viewing Kafka, Redis and Crawler stats/metrics? What core features @madisonb, would you like to see in the UI?

Lightweight UI could be created with flask and some static html templates or a more extensible UI with angular/react. I would be interested in contributing and could start building something in another week or so.

image

madisonb commented 8 years ago

@damienkilgannon Thanks for your interest! I have a bit more work to do on the Rest passthrough endpoint (Kibana logs) but it is pretty much ready to go. I built a special branch of the docs if you are interested here before it gets merged into the dev branch, so you can get a sense of what the Rest endpoint will do.

My vision for the UI would be to utilize the existing Kafka Monitor API return values to make something like you are saying above -> display basic information about the cluster, view spiders, backlogs, and do some basic interaction. Maybe even a raw JSON page for those who have custom setups.

One thing I do have a question about is "Can we make the UI independent of the Rest service?" I imagine the interaction being like:

UI (flask app with templates, angular) <--- talks to ---> Rest service (new flask app, multithreaded component linked above) <--- communicates with ---> Kafka and the rest of the cluster

While UI is not my specialty, I would like to have unit testing and best practices applied here. If someone doesn't like the UI built, they can still use the rest endpoint and build their own - so I would like to leave them decoupled. I would suggest initially to look at the Kafka Monitor API docs here (since that is just translated into Rest) and we can iterate on things.

I also don't want to build something that is trying to emulate what we do with Kibana. Kibana is awesome and our UI should complement it or help assist those that dont have access to the more complex ELK stack.

damienkilgannon commented 8 years ago

Yes, that sounds good. Definitely think keeping it decoupled from the rest app is the best way forward. Testing and best practices are fair enough, maybe can look at mocking the rest api for testing of the UI. Will mock/sketch up UI design I am thinking about. Easier to compare notes and understand projects needs that way.

damienkilgannon commented 8 years ago

Something like this? @madisonb

scrapyclusterUI.pdf

damienkilgannon commented 8 years ago

image image image

madisonb commented 8 years ago

Those look like good first mock ups, you certainly have got my wheels turning on what we would really like to see in a UI! Keep in mind we can generate priorities for work for version 1.3, but the API isn't going to change much (or have new additions) for 1.2, so I say we stick with what we have for now. I like your initial cut at things, and let me lay out a more detailed approach I think would be viable.

Lots of thoughts to follow.


Template/Theme

Overview/Landing Page A high level landing page that shows you a condensed version of the stats:all request. Shows things like:

I actually like the idea of moving the "Submit Job" UI to the landing page, and since this is the raw tool for the cluster I would like to expose all configurable options (perhaps at an "Advanced Submit Job" screen. Keeping it simple is great, I say go with url, maxdepth, expires, and maybe appid (it can default to ui-submitter or something), but auto generate the crawlid to be a uuid4. In the advanced area we can then expose all of the other features the crawl API exposes.

Active Jobs Like your pic above, the place to see where your current jobs at a high level. Since we don't quite have an API to view all existing appid's, we may have to force the user to supply at the minimum an appid, or both the appid and crawlid. This returns the line item like you mentioned, but with the following attributes (if they can all fit):

If you were to click on this list item, you would be taken to a sub-job page to view the 'high' and 'low' priority for every domain, and the total count for each domain. I picture a table/list view with all of this information available to the user. Within each sub-job view, there is a button to Stop the job, using the action:stop API - we already have all the information we need to submit a stop job request!

Stats We can do a lot with the Stats API, but it can be computationally expensive to create some of the stats on really large/backlogged clusters. If we want, we can utilize the stats:kafka-monitor, stats:redis-monitor, and stats:spider requests to give a high level view of performance metrics.

Or, we could use everything if we go with the sub-pages, it makes for an easy tree breakdown of

I'm not sure if we want to reinvent D3 charts, or Kibana for that matter, but we have some nice numbers to work with - keep in mind the stats are somewhat dynamic, so hardcoding values like 604800 isnt something we want here.

Output I like this, and I think we could have a single input box here for the rest endpoint to sample the Kafka stream, for a desired period of time (giving the user some options here like 5sec, 10sec, 30sec, 1min), and JSON dumps the results in a basic list view. This would be really helpful for debugging purposes, instead of SSH'ing onto your Kafka cluster or running a python script to sample it. We could have some built in defaults here for the kafka topics, or you could supply your own topic name.

Raw JSON Lastly, there are still some API's we can't quite cover because you can't query the full state of the system. The Zookeeper API and Blacklist API are two examples, and there are some times where you just want to throw some raw JSON at your cluster especially if you are customizing it. I think this would be helpful, but perhaps with a banner at the top stating to be careful and understand what you are doing.

That gives us the following sitemap breakdown:


After writing all of this up, I see things I that are needed, but probably wont get to until 1.3:

What are your thoughts? It seems like a lot - and it is a lot. But any help is appreciated!

damienkilgannon commented 8 years ago

@madisonb great feed back, thanks. That first swipe at the UI design has got us on the same page now. Clearly, there is loads of functionality that can be incorporated in the UI as you mentioned above and I think it should be all included as long as features don't make the design or code base over complicated hampering potential for future extensibility. I will re-do the mock UI design based on your feedback and evaluate what could be left until 1.3. Will be back in a few days.

damienkilgannon commented 7 years ago

@madisonb I have made some adjustments to your sitemap, but follows the same idea:

Landing page/overview page calls 'rest' to make stats:all request to populate the top row with overview of cluster stats (request is made on a page refresh). 'rest' is then called again to make a job submit using the 'basic job submit' form on landing page. And finally at the end of the land page the user can initiate a 'rest' to query active jobs based on appid.

All following on from your description and breakdown previous.

Top right corner has link to the readthedocs site.

Styling will be keep clean and simple with options for the user to customize to a certain extent.

See a new mock up of the landing page using bare bootstrap styling.

In regards to redis-montior, kafka-monitor and crawler stats; I think its a good idea to keep them exposed on the main nav bar. They would probably be the features which would contain the most valuable info for a user and quick access would be nice. I would think a simple page refresh and even a refresh button to fire of a call to 'rest' to populate the stats on these pages would be suitable. Nothing fancy for v1.2 on presentation of stats, just raw data to get started.

Let me know your thoughts. I can probably getting starting building this later in the week?

image

madisonb commented 7 years ago

Mockup looks great!

I finally was able to tidy up the rest component so now that is merged into the dev branch and dockerized - so it should make for fairly straightforward testing. If you need any help setting up a working cluster feel free to reach out more informally on Gitter as well.

EDIT: Full speed ahead

damienkilgannon commented 7 years ago

Hey @madisonb, so what I have ended up doing is creating a kafka producer which periodically sends stats request at a preset time frequency and I have then created a consumer which is continuously listening for the responses to these requests. The consumer will validate stats messages and write to a file. The file will be overwriting each time a new stats response is received, thus acting as cache for the latest stats from the cluster. The UI then loads these most recent stats, yet to decide best way to get the UI to reload/refresh the stats. Trying to keep things as simple and straight forward as possible. Will commit some code in the coming days to discuss further.

damienkilgannon commented 7 years ago

Created a pull request for this, the first pull request is going to provide a very simple UI. With the goal of extending its functionality in the future. This UI will serve to key tasks; provide update on current cluster status (redis connected, kafka connected) and a form to submit a crawl request to the cluster.

3rawkz commented 7 years ago

You guys are are my kinda of peeps! year a while know I have been putting aside creating a project with the same scope though just know introduced to cluster... Just earlier today started the build process after getting tired of scp'ing and ssh'ing into my production cloud account to run my data mining projects... Running the spider in "headless"... detached really, subprocessing etc... Got some time of so here I go!! About to give cluster a spin, see if it turns me off of scrapyd, cheers!

villeristi commented 6 years ago

Hey guys! What is the situation with this one?

madisonb commented 6 years ago

@villeristi both #174 and #116 are in a partial working state to get the UI into the main branches.