scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
52.84k stars 10.53k forks source link

Pipelines documentation limited #2350

Open mohmad-null opened 8 years ago

mohmad-null commented 8 years ago

I've spent quite a while going through the documentation, and while I like the concept off pipelines, no-where can I find documentation which shows how to fully implement them end-to-end.

The Pipelines page (https://doc.scrapy.org/en/latest/topics/item-pipeline.html) only shows code for the pipeline itself, not how to use it / plug it in to the main project.

The example project quotesbot is no better. While it does contain a pipelines.py, the class within it is never used. Indeed, filling this file with junk that should raise numerous syntax Exceptions doesn't do anything either. As examples go, it is extremely limited. items.py is similarly ignored.

I would suggest the following:

djunzu commented 8 years ago

@mohmad-null Pipelines page shows how to activate a pipeline and shows how to fully implement 5 pipelines. I don't see how pipelines doc is limited.

mohmad-null commented 7 years ago

Hmmm. Missed that, thanks for pointing that out. I think in large part I missed it because I expect the "how to use it" to be at the start of the page, not at the very bottom.

Also, as noted, none of those 5 examples show how to actually implement it with the rest of the spider, and the points about the Example project stand too.

redapple commented 7 years ago

Good poiint @mohmad-null . Documentation improvements Pull Requests are welcome.

djunzu commented 7 years ago

@mohmad-null

I agree; "how to use it" should come first. (Yet the reader should read everything!)

The example projects is a different issue. This issue is about "Pipelines documentation limited". :) (By the way, take a look at #2024 and maybe give some comments there.)

I don't go example by example, but let's take the last one: Duplicates Filter. You populate your items with an id field and return the items in your spider. You activate the pipeline. Items will pass trough this pipeline and if two or more items have the same id all of them but the first will be ignored. How doesn't it show how to actually implement it with the rest of the spider? What is missing in this example? If you just copy the code and return items with an id field populated it will just work!

mohmad-null commented 7 years ago

@djunzu - for your example of the Duplicates Filter, my thoughts would be:

zd-project commented 6 years ago

I definitely see why this may confuses a beginner in that declaring a pipeline is different from invoking it (by including it in setting.py). The tutorial of scrapy doc only briefly mentions "to specify this in the settings" without emphasizing that settings.py is the way to customize a crawler and using pipeline is considered a customization/extension, hence requiring specification in settings.py. I think it's not only a matter of pipeline documentation but also that documentation in general should be better structured to guide users in beginner level to proceed to intermediate or advanced level. Does this make sense to you? @djunzu

djunzu commented 6 years ago

documentation in general should be better structured to guide users in beginner level to proceed to intermediate or advanced level

I agree that documentation in general could be improved. A lot! But I think the documentation is not a tutorial and should not focus on teaching beginners how to do simple things. For me, Scrapy is a framework for programmers and not for non-programmers. If someone don't understand the documentation because he/she does not know basic programming, then I'm sorry but documentation should not be changed because of that. (I am not saying it is the case here. But I see a lot of questions/issues/doubts that are a simple lack of knowledge in basic programming.)

  • How do I integrate it with the rest of the spider?

Maybe documentation fails doing it crystal clear. But for me, if you read it carefully you will understand how to do it. Maybe I can understand the documentation because I already know how to use it. Again: documentation can be improved and help is appreciated.

  • What is "item" and how do I define it? (I get that there's the entire "items" page of the help, but the documentation could really use better integration of the various concepts).

I think "item" is a basic principle in Scrapy. I think that if you are reading about pipelines and don't know what is "item", you should first learn the basics of Scrapy: spiders and items.

  • Is "DropItem" a built in exception or something custom created? If the later, what's the definition/handling look like?

Again: documentation can be improved and help is appreciated.


I really think documentation could be a lot better! But I think it should not be a tutorial. It should describe how things works in Scrapy and not give a step-by-step beginner guide on how to do things. (Note that there is a beginner tutorial in documentation so anyone can start its journey!)