Closed dangra closed 10 years ago
Travis needs a cssselect version with pseudo-elements improvements. Watch SimonSapin/cssselect#33
Travis passed! (: Can we merge and close this?
Merge! Merge! Merge!
The pull request is still in progress, when it's ready to merge the title will change from WIP to MRG
thanks for the enthusiasm guys!
hi everyone watching, this PR reached merge status. Who wants to review before it's merged?
hey, before I merge I want to bring two naming changes to discussion
This PR replaces hxs
and xxs
by ss
in docs, spiders code and shell
@pablohoffman suggested using just s
, I find ss
less namespace clashing specially on shell, and easier to grep for. which one do you prefer?
Selector
class grow a new constructor argument named contenttype
, it chooses the parser + serialization-method + css-translation flavor to use with the response.
To force XML specifics, you instanciate the selector like:
ss = Selector(text='<tag>...</tag>', contenttype='xml')
,
and similarly for HTML using contenttype='html'
The name of this new argument is under discussion, so far the options are: contenttype
, flavor
or parser
.
I think it is important to get this right, it is going to be part of the public API.
which one do you prefer?
sel
- it still has the same length as old hxs
/xxs
, but not as cryptic;parser
would be a good name if it also acceps an instance of some parser class, not only string shortcuts. None
default value could be changed to 'auto'
.@kmike:
ss
stands for Scrapy Selector, but I must admit sel
is a good option. let's vote.contenttype
for the same reason, I prefer flavor
.
the problem with parser
is that it isn't just about selecting the parser, it also chooses the serialization and css translation methods. ss
a bit awkward, sel
as @kmike suggest is fine by me, sltr
is cryptic I know :)contenttype
hints at a MIME type.s
as an anonymous, short-lived variable name, not worrying about namespace clashesparser
to contenttype
, and don't worry as much as @dangra does about conditioning other parts, unless those other parts would be plugabble too. Another option could be just type
? I really don't like parser
, if we allocate it now for this, it can't hold a real parser instance later (without doing magic conditionals)
btw, +1 to default contenttype
(or whatever name it takes) to "auto"
what about using exactly that, a content type as in HTTP headers?
to accomodate "text/html", "text/xml", and even "application/json" if some day the API supports JSONPath (http://goessner.net/articles/JsonPath/),
and thus use contenttype
as argument name
(of course, there are a few variations in content types for XML and HTML, but the API could accept a handful)
If the use for contenttype/flavor/parser
is for when you give it some text, it makes sense to me to accompany it with a Content-Type
@redapple: the argument name is going to be exposed as a public attribute too.
We foresee JSONPath support, although I was thinking on using json
as value for that.
your proposal restore the sense on contenttype
. If we stay on it, using text/html
, text/xml
,... fits better.
I wouldn't want to ruin the show here, but it's appearing to me now that the Selector class should probably be called a Document class of some sort, just like Nokogiri has Nodes and NodeSets, which have .css()
and .xpath()
methods to select sub-nodes
But Scrapy users are used to the "selector" term, so...
JSON looks different because you can't issue xpath or css selector over json, and you can't use 'json()' on html or xml data. So it seems there would be almost zero shared code between existing Selector and json-enabled selector ('re' method could still work). So I wonder if there is a value in having a single selector class for that: based on content-type you'll get totally different behavior and supported features. Changing interface depending on constructor parameter is not a good thing.
Unified selector API unified two different substances:
1) executing CSS and XPath selectors on parsed responses; 2) parsing HTML and XML responses.
I see Nogokiri to be more similar to lxml than Scrapy selectors.
Scrapy Selectors doesn't aim to be able to modify the underlying parsed document, it doesn't expose the parsed document either. It is aiming to be a querying tool.
ok
sel
over ss. I use just one selector on scrapy so I don't need the differentiation of Scrapy Selector
, I just call them Selector
and everybody understand what I'm talking about.contenttype
because is large, and I don't want to write text/html
, text/xml
when the text/
really doesn't matters.type
could be good but is a reserved word ctype
can be confusing because C
langparser
not sure if it is safe pass the parser instance to selectors, and can be confusingflavor
no objections it is short and makes some reference :+1: I think sel
is definitely better than ss
, my PEP8 parser screams at me when I use variable names shorter than 3 characters because it will always be cryptic. sel = Selector(yaddayadda)
is pretty obvious and doesn't hurt my eyes.
As for flavor
, it introduces more scrapyism, doesn't it? I like contenttype or just content.
sel
I'm giving contenttype
another chance,
Selectors are tipicaly instanciated from a response object:
sel = Selector(xmlresponse)
Making contenttype
only useful together with text
where a explicit MIME type makes sense:
sel = Selector(text="<tag>...</tag>", contenttype="text/xml")
And as last option, forcing an incorrectly detected response to a specific type, must say this is uncommon and can be considered a bug somewhere else.
sel = Selector(htmlresponse, contenttype='text/xml')
I think that following MIME types is not necessary:
1) as @nramirezuy said, this is more letters to type; 2) are there reasons why we want to support anything besides xml and html in Selector API? If not, supporting MIME types as values could be misleading.
Two more API ideas:
Selector(text="<tag>...</tag>", parse_as='xml')
Selector(text="<tag>...</tag>", method='xml') # this follows lxml API
I definitively think the keyword length isn't the issue here, if it was then s
is much better letters save than sel
because it is typed lot more times than contenttype
vs its contenders.
As for JSONPath, I still don't see why selectors can't support it even if .xpath()
or .css()
makes no sense over JSON content. Not sharing underlying code for matching json documents is an implementation detail, the other parts of the api like .extract()
and .re()
, nesting and SelectorList
still makes sense.
How would I extract the contenttype
information (or whatever it gets called in the end) from the Response object?
I thought it'd be like Selector(text=res.body, contenttype=res.headers['Content-Type'])
.
If it works like that, I'm definitely for contenttype
.
@barraponto: if you have a response object, you won't use contenttype
at all, instead do Selector(response)
as I described in point 2 of https://github.com/scrapy/scrapy/pull/426#issuecomment-26352659, the contenttype
constructor argument is only useful when you haven't a response object to select from, that arguably is very very uncommon in spider code.
@kmike: parse_as
is as bad as flavor
or type
in the sense that it is a new term. method
is lxml lingua for serialization only, it doesnt choose the underlying parser.
sel
because it is short and very descriptive.parser
seems like a good choice to set the parser (and other related stuff).I'm decanting to contenttype
with MIME types. hopes it doesn't hurt your feelings ;)
As I said, this is a rarely used feature that only makes sense with text
argument.
contenttype
could also be used with response
argument if the caller knows that the ContentType
header and perhaps also the <meta>
tag are wrong and wants to change it.
Poor parser parameter will never get created here, sigh.
Documentation improvements accepted trough additional pull requests ;)
ItemLoader needs a similar refactor just in case you feel idle.
So Long, and Thanks for All the Fish!
:cake:
Implementation of #395, and serves as continuation for #176 .