istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

how to process different type of item in Processor? #219

Closed ChengkaiYang2022 closed 5 years ago

ChengkaiYang2022 commented 5 years ago

So,here is the case: I have 4 Pipelines,called pA,pB,pC,pD and two types of item called Item1,Item2. Item1 should be processed by pA,pB,pC,pD. Item2 should only be processed by pA and pC. Of cause I have to set the ITEM_PIPELINES={'pA':1,'pB':2,'pC':98,'pD':99} but in this way the Item2 will be processed by pB and pD,and this is wrong. So in the method "process_item" in pB,pD,I will check the type of item(type(item).name ),if it is Item1,it will be processed,if it is Item2,it will not.

But I think this may not the best way,what if there is 10 pipelines and 5 types of the Item? the code must be messy,is there any good way to solve this problem?
ChengkaiYang2022 commented 5 years ago

Maybe we can simplified this problem by using a dict,like {'pA':['Item1','Item2'],'pB':['Item1'],'pC':['Item1','Item2'],'pD':['Item1']} in file settings.py,and check this dict in all pipelines from pA to pD.But this is still a question when using scrapy cluster.

ChengkaiYang2022 commented 5 years ago

Any help will be appreciate:)

madisonb commented 5 years ago

This is mostly a Scrapy issue, not a scrapy cluster issue (since you will face the same problem using both projects). I am not a heavy user of Scrapy's Item Pipeline, as we mostly use it for transforming into json and moving it out of Scrapy and into a different pipeline framework (like Storm, Heron, NiFi, Flink, etc).

The way you are doing it would be what I would probably do, but in reality I would move my more complex item processing logic out of Scrapy.

Closing