scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.3k stars 217 forks source link

passing `meta` parameters in distributed backends mode for sqlalchemy #162

Open wetneb opened 8 years ago

wetneb commented 8 years ago

Hi, I do not understand how to set meta parameters in a frontier Request generated from a seeder. It seems that there are two kinds of meta parameters: frontier ones and scrapy ones. I would like to set scrapy meta parameters so that my scrapy middlewares get to see them. It seems that they have to be set as meta['scrapy_meta'] = my_scrapy_meta, but when the request arrives in my middleware, these parameters disappear (only the 'frontier_request' argument remains). Any idea where this comes from? Should I translate my middleware to a Frontier middleware (that would work on frontier Requests)? Thanks a lot!

sibiryakov commented 8 years ago

Seed loaders are Scrapy spider middlewares. All the same rules should apply as to Scrapy middlewares. I need to know your Frontera cluster setup: backends, message bus and run mode to help you.

wetneb commented 8 years ago

Thanks a lot for your reply! I'm using the distributed setup with ZeroMQ, and the default run mode. I can see that the meta parameters I introduce in the seeder are still available when the requests arrive in the DB and strategy workers.

What is the status of the converters here: https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/scrapy/converters.py Are they involved in the conversion from the frontier request to the scrapy one? If so, when does that happen?

sibiryakov commented 8 years ago

@wetneb What backend do you use? In case of HBase meta isn't persisted, but in SQLA backend it is. Converters are used in spider processes, and conversion happens all the time when request is read from Frontera and response is returned back.

wetneb commented 8 years ago

@sibiryakov Thanks! I'm using frontera.contrib.backends.sqlalchemy.Distributed as a backend, so meta is indeed persisted there. I suspect meta disappears during the conversion process in the spider. I will try to debug that.

wetneb commented 8 years ago

Changing the backend to 'frontera.contrib.backends.sqlalchemy.SQLAlchemyBackend' solved the issue indeed. But I needed to keep the Distributed backend for the strategy worker, is that normal? And what is the rationale behind keeping meta in one backend but not the other? Thanks a lot anyway!

sibiryakov commented 8 years ago

@wetneb oh that's great you found it. https://github.com/scrapinghub/frontera/blob/master/frontera/worker/strategies/__init__.py#L90 It's not transferred for historical reasons, but it makes sense to do so. PR's are always welcome.

wetneb commented 8 years ago

Excellent, I'll try to do that then. Thanks a lot!