scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

[MRG] Selectors unified API #426

Closed dangra closed 10 years ago

dangra commented 10 years ago

Implementation of #395, and serves as continuation for #176 .

dangra commented 10 years ago

Travis needs a cssselect version with pseudo-elements improvements. Watch SimonSapin/cssselect#33

nramirezuy commented 10 years ago

Travis passed! (: Can we merge and close this?

barraponto commented 10 years ago

Merge! Merge! Merge!

dangra commented 10 years ago

The pull request is still in progress, when it's ready to merge the title will change from WIP to MRG

thanks for the enthusiasm guys!

dangra commented 10 years ago

hi everyone watching, this PR reached merge status. Who wants to review before it's merged?

dangra commented 10 years ago

hey, before I merge I want to bring two naming changes to discussion

  1. This PR replaces hxs and xxs by ss in docs, spiders code and shell

    @pablohoffman suggested using just s, I find ss less namespace clashing specially on shell, and easier to grep for. which one do you prefer?

  2. Selector class grow a new constructor argument named contenttype, it chooses the parser + serialization-method + css-translation flavor to use with the response.

    To force XML specifics, you instanciate the selector like:

    ss = Selector(text='<tag>...</tag>', contenttype='xml'),

    and similarly for HTML using contenttype='html'

    The name of this new argument is under discussion, so far the options are: contenttype, flavor or parser. I think it is important to get this right, it is going to be part of the public API. which one do you prefer?

kmike commented 10 years ago
  1. another option is sel - it still has the same length as old hxs/xxs, but not as cryptic;
  2. I don't like contenttype name because one could expect to use 'text/html' instead of 'html'. parser would be a good name if it also acceps an instance of some parser class, not only string shortcuts. None default value could be changed to 'auto'.
dangra commented 10 years ago

@kmike:

  1. ss stands for Scrapy Selector, but I must admit sel is a good option. let's vote.
  2. I don't like contenttype for the same reason, I prefer flavor. the problem with parser is that it isn't just about selecting the parser, it also chooses the serialization and css translation methods.
redapple commented 10 years ago
  1. I find ss a bit awkward, sel as @kmike suggest is fine by me, sltr is cryptic I know :)
  2. I also think contenttype hints at a MIME type.
pablohoffman commented 10 years ago
  1. I would just use s as an anonymous, short-lived variable name, not worrying about namespace clashes
  2. I prefer parser to contenttype, and don't worry as much as @dangra does about conditioning other parts, unless those other parts would be plugabble too. Another option could be just type?
dangra commented 10 years ago

I really don't like parser, if we allocate it now for this, it can't hold a real parser instance later (without doing magic conditionals)

dangra commented 10 years ago

btw, +1 to default contenttype (or whatever name it takes) to "auto"

redapple commented 10 years ago

what about using exactly that, a content type as in HTTP headers? to accomodate "text/html", "text/xml", and even "application/json" if some day the API supports JSONPath (http://goessner.net/articles/JsonPath/), and thus use contenttype as argument name

(of course, there are a few variations in content types for XML and HTML, but the API could accept a handful)

If the use for contenttype/flavor/parser is for when you give it some text, it makes sense to me to accompany it with a Content-Type

dangra commented 10 years ago

@redapple: the argument name is going to be exposed as a public attribute too.

We foresee JSONPath support, although I was thinking on using json as value for that.

your proposal restore the sense on contenttype. If we stay on it, using text/html, text/xml,... fits better.

redapple commented 10 years ago

I wouldn't want to ruin the show here, but it's appearing to me now that the Selector class should probably be called a Document class of some sort, just like Nokogiri has Nodes and NodeSets, which have .css() and .xpath() methods to select sub-nodes

But Scrapy users are used to the "selector" term, so...

kmike commented 10 years ago

JSON looks different because you can't issue xpath or css selector over json, and you can't use 'json()' on html or xml data. So it seems there would be almost zero shared code between existing Selector and json-enabled selector ('re' method could still work). So I wonder if there is a value in having a single selector class for that: based on content-type you'll get totally different behavior and supported features. Changing interface depending on constructor parameter is not a good thing.

Unified selector API unified two different substances:

1) executing CSS and XPath selectors on parsed responses; 2) parsing HTML and XML responses.

dangra commented 10 years ago

I see Nogokiri to be more similar to lxml than Scrapy selectors.

Scrapy Selectors doesn't aim to be able to modify the underlying parsed document, it doesn't expose the parsed document either. It is aiming to be a querying tool.

redapple commented 10 years ago

ok

nramirezuy commented 10 years ago
  1. I like sel over ss. I use just one selector on scrapy so I don't need the differentiation of Scrapy Selector, I just call them Selector and everybody understand what I'm talking about.
    • I dislike contenttype because is large, and I don't want to write text/html, text/xml when the text/ really doesn't matters.
    • type could be good but is a reserved word
    • ctype can be confusing because C lang
    • parser not sure if it is safe pass the parser instance to selectors, and can be confusing
    • flavor no objections it is short and makes some reference :+1:
barraponto commented 10 years ago

I think sel is definitely better than ss, my PEP8 parser screams at me when I use variable names shorter than 3 characters because it will always be cryptic. sel = Selector(yaddayadda) is pretty obvious and doesn't hurt my eyes.

As for flavor, it introduces more scrapyism, doesn't it? I like contenttype or just content.

dangra commented 10 years ago
  1. based on votes I'd say the dust settled on sel
  2. I'm giving contenttype another chance,

    Selectors are tipicaly instanciated from a response object:

    sel = Selector(xmlresponse)

    Making contenttype only useful together with text where a explicit MIME type makes sense:

    sel = Selector(text="<tag>...</tag>", contenttype="text/xml")

    And as last option, forcing an incorrectly detected response to a specific type, must say this is uncommon and can be considered a bug somewhere else.

    sel = Selector(htmlresponse, contenttype='text/xml')
kmike commented 10 years ago

I think that following MIME types is not necessary:

1) as @nramirezuy said, this is more letters to type; 2) are there reasons why we want to support anything besides xml and html in Selector API? If not, supporting MIME types as values could be misleading.

Two more API ideas:

dangra commented 10 years ago

I definitively think the keyword length isn't the issue here, if it was then s is much better letters save than sel because it is typed lot more times than contenttype vs its contenders.

As for JSONPath, I still don't see why selectors can't support it even if .xpath() or .css() makes no sense over JSON content. Not sharing underlying code for matching json documents is an implementation detail, the other parts of the api like .extract() and .re(), nesting and SelectorList still makes sense.

barraponto commented 10 years ago

How would I extract the contenttype information (or whatever it gets called in the end) from the Response object? I thought it'd be like Selector(text=res.body, contenttype=res.headers['Content-Type']).

If it works like that, I'm definitely for contenttype.

dangra commented 10 years ago

@barraponto: if you have a response object, you won't use contenttype at all, instead do Selector(response)

dangra commented 10 years ago

as I described in point 2 of https://github.com/scrapy/scrapy/pull/426#issuecomment-26352659, the contenttype constructor argument is only useful when you haven't a response object to select from, that arguably is very very uncommon in spider code.

dangra commented 10 years ago

@kmike: parse_as is as bad as flavor or type in the sense that it is a new term. method is lxml lingua for serialization only, it doesnt choose the underlying parser.

stav commented 10 years ago
  1. I like sel because it is short and very descriptive.
  2. parser seems like a good choice to set the parser (and other related stuff).
dangra commented 10 years ago

I'm decanting to contenttype with MIME types. hopes it doesn't hurt your feelings ;)

As I said, this is a rarely used feature that only makes sense with text argument.

stav commented 10 years ago

contenttype could also be used with response argument if the caller knows that the ContentType header and perhaps also the <meta> tag are wrong and wants to change it.

Poor parser parameter will never get created here, sigh.

dangra commented 10 years ago

Documentation improvements accepted trough additional pull requests ;)

ItemLoader needs a similar refactor just in case you feel idle.

So Long, and Thanks for All the Fish!

kmike commented 10 years ago

:cake: