scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.65k stars 10.41k forks source link

Don't require 'name' attribute for scrapy.Spider #1157

Open kmike opened 9 years ago

kmike commented 9 years ago

I think we should make Spider.name attribute optional. The name is used by SpiderManager to find spiders, but Spider can be used without a Scrapy project. It is unnecessary boilerplate for users of runspider command or for CrawlerRunner / CrawlerProcess users.

We can also provide a default value, e.g. self.__class__.__name__, to help with discovery; this have an advantage of 1-to-1 mapping between spider class names and names printed to users - spiders can become easier to find.

Opinions?

dangra commented 9 years ago

It is a good idea, do you plan to implement the default .name as a python property ?

kmike commented 8 years ago

do you plan to implement the default .name as a python property

i like this idea, haven't thought about it

eliasdorneles commented 8 years ago

I like it too. :)

So, would be something like:

    @property
    def name(self):
        return getattr(self.__class__, 'name', self.__class__.__name__)

And without the check for the name attribute on init?

eliasdorneles commented 8 years ago

Uh, I've just realized the above code would fail for the case when name is set via __init__.

eliasdorneles commented 8 years ago

Actually, would fail for other cases as well, I clearly didn't think this through. :D

qinfuqi commented 8 years ago

Hi, Dear all, How could I get rid of the subscribe of this email?

Best regards Fu Qi

在 2015-11-02 20:48:50,"Mikhail Korobov" notifications@github.com 写道:

do you plan to implement the default .name as a python property

i like this idea

— Reply to this email directly or view it on GitHub.

kmike commented 8 years ago

@qinfuqi I don't know; it depends on how did you get subscribed to it. Maybe you need to click "Unwatch" button somewhere. There is also 'Unsubscribe' button in the left column here. We haven't subscribed you to these emails; I think you're using some Github feature, and it is a general Github usage question. Sorry - we can do nothing; we don't have powers to unsubscribe you.

aron-bordin commented 8 years ago

Hi @kmike I'd like to start to contribute with scrapy and I'm trying to work on this issue. I've some questions about it:

kmike commented 8 years ago

Hey @aron-bordin, thanks for offering help!

in class names that ends with Spider, should the default name contains the Spider? eg. class StackOverflowSpider(scrapy.Spider): -> should I name it as StackOverflowSpider or StackOverflow?

I see what you're heading to, but I think we should preserve the name as-is.

if we use a property or add some code on init to create a default name, we need to instantiate this object to get the default value. Some classes (https://github.com/scrapy/scrapy/blob/master/scrapy/spiderloader.py#L25, https://github.com/scrapy/scrapy/blob/master/scrapy%2Fcommands%2Fcheck.py#L78) checks the spider name. So to get the default value working, it's necessary to instantiate the class in these files. What do you think about it? Can I create a spcls_instance = spcls() to get the default name?

We shouldn't call __init__ and create Spider instances just to get their names; there are many problems with calling __init__ - some spiders may have required attributes, other spiders may do heavy work in __init__ or create/delete some resources. __init__ is called after Crawler instance is created, and spiders may rely on it.

This means @property on object methods won't work; to make magic name work on class you can define a metaclass. Another option (which I prefer more - it is less magical and easier to support) is to create and use a Spider.get_spider_name classmethod which returns spider name either from .name attribute or from class name.

aron-bordin commented 8 years ago

I opened a pr. Please, when possible, check if my approach is good and if there is something that I can change to improve it. Thx

nyov commented 8 years ago

Is there actually any greater value in this Spider-class name attribute? Does it make sense to put this emphasis on it, instead of simply using the Spider-class name itself? If, as @kmike says, there is only SpiderLoader using it, that code looks straightforward enough to substitute looking up the Spider-class name instead (or simply pulling in all [Base]Spider subclasses).

The only interesting case here seems to be having different classes with the same name attribute? What's the defined behavior there, and could it be replicated by a user using python class inheritance rules instead? In which case I would downgrade this into an extension/addon feature, for whom may care, instead of trading in more magic for less importance ;)

Gallaecio commented 5 years ago

If and when this is done, please take into account that Spidermon currently uses spider names, for example to generate unique, spider-specific filenames for storing data in disk. You can search for spider.name there to find some of those usages.

Any change in this direction in Scrapy should probably be accompanied by the corresponding change in Spidermon.

GeorgeA92 commented 7 months ago

List of usages of spider.name attribute in scrapy source code:

1. scrapy/utils/engine/get_engine_status Used in memusage extension, telnet console extension and in at least 1 test [get_engine_status query](https://github.com/search?q=repo%3Ascrapy%2Fscrapy%20get_engine_status&type=code) https://github.com/scrapy/scrapy/blob/42b3a3a23b328741f1720953235a62cba120ae7b/scrapy/utils/engine.py#L11-L27
2. scrapy/extensions/statsmailer.StatsMailer extension https://github.com/scrapy/scrapy/blob/42b3a3a23b328741f1720953235a62cba120ae7b/scrapy/extensions/statsmailer.py#L28-L34
3. scrapy/templates/project/module/middlewares Code with `spider.name` created on .. start project command (so almost every scrapy project from proj template can have `spider.name` call) https://github.com/scrapy/scrapy/blob/42b3a3a23b328741f1720953235a62cba120ae7b/scrapy/templates/project/module/middlewares.py.tmpl#L56
4. scrapy/scrapy/statscollectors/MemoryStatsCollector https://github.com/scrapy/scrapy/blob/42b3a3a23b328741f1720953235a62cba120ae7b/scrapy/statscollectors.py#L74
5. parse command https://github.com/scrapy/scrapy/blob/42b3a3a23b328741f1720953235a62cba120ae7b/scrapy/commands/parse.py#L204
6. httpcache extension Both dbm and filesystem cache storage require `spider.name` https://github.com/scrapy/scrapy/blob/42b3a3a23b328741f1720953235a62cba120ae7b/scrapy/extensions/httpcache.py#L342-L344 and https://github.com/scrapy/scrapy/blob/42b3a3a23b328741f1720953235a62cba120ae7b/scrapy/extensions/httpcache.py#L228-L230
7. spiderloader when we type something like `scrapy crawl quotes` spider loader will try to find spider with `quotes` as value of `spider.name`
8. scrapy/scrapy/utils/url/url_is_from_spider -> spider.__init__ -> spiderloader https://github.com/scrapy/scrapy/blob/42b3a3a23b328741f1720953235a62cba120ae7b/scrapy/utils/url.py#L35-L39 In Spider class: https://github.com/scrapy/scrapy/blob/42b3a3a23b328741f1720953235a62cba120ae7b/scrapy/spiders/__init__.py#L90-L92 + some part of spider loader and it's all for maintaining this `spidercls_for_request`: https://github.com/scrapy/scrapy/blob/42b3a3a23b328741f1720953235a62cba120ae7b/scrapy/utils/spider.py#L112-L129 which is used in `fetch`, `parse` and `shell` commands https://github.com/search?q=repo%3Ascrapy%2Fscrapy+spidercls_for_request&type=code
  1. tests for points mentioned above, it's mentions on docs and specific spiders with defined name attribute (I skipped it here)
  2. other 3rd party libs that also can use spider.name. Here mentioned spidermon
rennerocha commented 7 months ago

other 3rd party libs that also can use spider.name. Here mentioned spidermon

If Scrapy adds a default property name for the scenario where the user doesn't define a name attribute, probably it won't require any changes in Spidermon. Is there a PR for this change so I can run it and test in Spidermon?

Gallaecio commented 7 months ago

I created https://github.com/scrapy/scrapy/pull/4327 some time ago, but I was not planning on resuming that work myself any time soon. In any case, having name return some name by default (the spider class import path?) is something to consider for sure.

GeorgeA92 commented 5 months ago

My preferred approach to solve this is https://github.com/scrapy/scrapy/pull/1649 at least on my attempts to apply it I didn't have issue with base class scrapy.Spider in scrapy list output mentioned on https://github.com/scrapy/scrapy/pull/1649#issuecomment-178697597 (but it still require to resolve merge conflict to macth current scrapy version) probably because spiderloader itself also significantly changed from 2016 - simply because with this approach only change for base spider class required (and do not required to update each point from https://github.com/scrapy/scrapy/issues/1157#issuecomment-1816782204)