scalingexcellence / scrapybook

Scrapy Book Code
http://scrapybook.com/
475 stars 209 forks source link

I don't understand the spark mentioned in the ch11 chapter #4

Closed xiaowenjie6434 closed 8 years ago

xiaowenjie6434 commented 8 years ago

I don't understand the spark mentioned in the ch11 chapter, and the following statement

FEEDURI='ftp://anonymous@spark/% (batch) s% (name) s_% (time) s.jl'() -a batch=12, to complete the distributed scrapy crawler, is the need to spark support it. If there is a detailed description of the relevant, will be better?

lookfwd commented 8 years ago

Hello @xiaowenjie6434. In order to do distributed crawling you certainly don't need Spark. You can run your crawlers and save their results to e.g. Amazon S3 or the local filesystem as described in the chapter. Out of 23 pages of Chapter 11, less than 5 refer to Spark. The rest are devoted to Scrapyd, tuning performance, developing necessary Scrapy middlewares and commands. This is material that will help you every time you need to perform distributed crawling.

What I try to address with Spark is this question; What do you do with all those data? One might typically want to process them as soon as they arrive or store them in a database. Whatever the need, it's highly recommended to perform operations in large chunks of data because otherwise the performance will suffer (e.g. if you try to insert each Item individually to a database) and any distributed crawling will be pointless. More details on the rationale can be found in the "Overview of our distributed system" section.

In Ch11 I demonstrate one of the most cool things one can do. We use the very popular stream processing system - Apache Spark - to process data as soon as they arrive and perform real-time analytics.

Usually setting up even a small distributed system, regardless of if it uses Apache Spark or not, is complex. Scrapy makes it easy. We can set-up a system using scrapy and scrapyd and one passes data to the other with just two settings: -s DISTRIBUTED_START_URLS=..., -s FEED_URI=... and one spider argument -a batch=.... The code (middleware) that makes this possible is described on the "Batching crawl URLs", and "Getting start URLs from settings" sections.

The example command line you're referring to comes from this last section. It leverages Scrapy's excellent support for various destination feeds (i.e. where your scraped files go). The book demonstrates how to upload files to an ftp server by setting FEED_URI to ftp://anonymous@spark/<filename>. Spark monitors this ftp directory and processes new files as soon as they arrive. If you want to upload them to Amazon S3, all you have to do is to change FEED_URI to e.g. s3://mybucket/scraping/feeds/<filename>. For more information on FEED_URI see also page 113 in Chapter 7, Configuration and Management and of course the excellent "Feed exports" section on Scrapy's documentation.

Spark and spark streaming have excellent documentation and right now they are so hot that many new books are being released every month. The classic on the subject is this but there are also many good more specialized books. I think every software engineer would benefit from investing a few weeks on learning Spark and I would be happy if it's Chapter 11 that inspires you to do so. On the other hand I would perfectly understand if you decide to skip it. In either case, this book provides the virtual machine and everything you need to run Chapter 11's examples without knowing anything about Spark.

I'm sure this doesn't completely answer your question but please elaborate more and I will explain more.

xiaowenjie6434 commented 8 years ago

I understand, distributed crawler does not need spark,spark are used to dealing with big data and real-time analysis, spark is very exciting technology. First reptiles still speed issues, distributed for me is the need to focus on. In the middleware code, I saw the Rule parameter, if not using spdier Rule to get the next page link, but more complex cases, such as, the next page is AJAX requests, then the Rule parameters in the middleware, is unable to get next request URL? If you are using is not easy.py but manual.py, how to write middleware about then? In addition, the distributed crawlers like can also use the scrapy+REDIS? Contrast scrapyd this way, there is a good and bad? This book explains scarpy knowledge already in place, concerning the use of middleware might need to analyze the source code to understand, I am a novice, is crawling with Rcurl data before, so novice scrapy is, give yourself an incentive, keep it up

lookfwd commented 8 years ago

Great! - Yes Spark is indeed great.

In the middleware code, I saw the Rule parameter, if not using spdier Rule to get the next page link, but more complex cases...

The mechanisms on the middleware are the most hard to write code you might need. The rest is a little glue logic that is unfortunately quite application specific and there's a huge variety of little implementation details. But they should be easy.

but manual.py, how to write middleware about then?

Great question. That's a very good example of what I mention above. As far as I can see there are just two tiny changes needed to do distributed crawl with manual.py. I didn't test it but with a quick look all you would have to do replace line 31 with

yield Request(urlparse.urljoin(response.url, url), meta={'rule': 0})

This is a bit hacky but it would enable the process_spider_output() method to pick and batch those requests. Then on the worker side, when start_requests has to be populated, you replace line 94 with

yield Request(url, spider.parse_item)

It should be that simple. This latter - by the way - should work with most spiders that use a parse_item method.

In addition, the distributed crawlers like can also use the scrapy+REDIS? Contrast scrapyd this way, there is a good and bad?

Also have a look on scrapy cluster. All those solutions are great. By using the pretty standard and simple scrapyd, with less than 200 lines we have a nice middleware that demonstrates and teaches distributed crawling. I think scrapy cluster and scrapy redis are less suited for a book. Either I would have to go into too much detail or I would have to just show how to install them and a very simple example but then one would get no real insight on how it's done from Scrapy's programming perspective.

scrapy redis is easy to setup and evaluate. scrapy cluster is more complex (kafka + zookeeper). Still you can use vagrant in order to evaluate it as they explain in their quick start guide. All I see is a tradeoff between feature richness (scrapy cluster) and simplicity (scrapy redis).

concerning the use of middleware might need to analyze the source code to understand

It's worth the effort. The more you try to read and understand complex code, the more able you become to solve many non-trivial scrapy problems. It's an essential supplement to quick solution-searching on Stack Overflow.

keep it up

Thanks a lot! :)