Closed elacuesta closed 7 years ago
Merging #2456 into master will increase coverage by
0.08%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #2456 +/- ##
==========================================
+ Coverage 84.66% 84.75% +0.08%
==========================================
Files 162 162
Lines 9122 9155 +33
Branches 1353 1358 +5
==========================================
+ Hits 7723 7759 +36
- Misses 1141 1145 +4
+ Partials 258 251 -7
Impacted Files | Coverage Δ | |
---|---|---|
scrapy/settings/default_settings.py | 98.64% <100%> (ø) |
:arrow_up: |
scrapy/exporters.py | 100% <100%> (ø) |
:arrow_up: |
scrapy/extensions/feedexport.py | 72.95% <100%> (+0.42%) |
:arrow_up: |
scrapy/commands/parse.py | 21.15% <0%> (ø) |
:arrow_up: |
scrapy/shell.py | 68.61% <0%> (ø) |
:arrow_up: |
scrapy/pipelines/files.py | 71.18% <0%> (ø) |
:arrow_up: |
scrapy/utils/misc.py | 95% <0%> (ø) |
:arrow_up: |
scrapy/utils/datatypes.py | 60.21% <0%> (ø) |
:arrow_up: |
scrapy/core/downloader/handlers/http11.py | 92.12% <0%> (+1.28%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 7fc11c1...3a0a86e. Read the comment docs.
I like the feature and implementation.
XML out definitely gets easier to read.
However, I'm not sure about having prettified-JSON as default. On the other hand, everyone uses jq
these days right?
And it would be weird to prettify XML by default, and not JSON.
I'll have to sleep on it.
@elacuesta , I believe it also needs tests for non-default (None, 2, etc.) widths.
Hey @redapple, thanks for your comments. I added a few tests with distinct width values.
Regarding the default, I set it to 4
following @kmike's comment and my own desire, but I think it would be just fine to leave it optional.
Alright, I set FEED_EXPORT_INDENT_WIDTH=None
to maintain backwards compatibility.
Ping @redapple, @kmike: I'm sorry to bother you guys, but do you have any further comments on this? Thanks!
Looks good to me @elacuesta
Looking at the new tests, I think this change reverts https://github.com/scrapy/scrapy/pull/1950 .
You're right Paul. I personally like the square brackets having their own line even if indent=None
, I made this latest change as per Mikhail's suggestion of following stdlib JSONEncoder
's behaviour.
I think diverting from the standard in some very particular cases (like this one, or the newline after the XML declaration tag) is not that big of a deal, I can make the brackets have their own line again if you guys agree. Thanks for your time 😀
Previous way of exporting to JSON kind of makes sense; it seems it becomes unsupported after this PR: no indentation inside items, but each item is on a new line. It is not only a question about brackets. It can be seen as a backwards-incompatible change: wc -l
will stop working as a way to count exported items in json output.
What do you think about making indent
option work only inside individual items in case of json, not for the whole output? That'd be a bit weird though..
I think we all agree that the current JSON output (one item per line) is just fine. However it's not consistent with the current XML output (all items in one line). What would you say if I modified this PR to make both outputs consistent in the following manner:
indent = None
: "most compact representation", i.e. all items in the same line (same as the current behaviour for XML)indent <= 0
: each item on it's own line but no "internal" indent, square brackets on their own line for JSON (same as the current behaviour for JSON)indent > 0
: indent with the specified numberThat would be a little deviation from the "standard" in the sense that indent=0
would not imply newlines everywhere, but I believe it would be worth in this case. Thoughts?
It sounds fine to me. Alternatively, we can support string constants for indent:
Though 'indent' argument name could be confusing with these value names
Updated according to https://github.com/scrapy/scrapy/pull/2456#issuecomment-284468472.
I also added support for string values (at the setting level, not in the ItemExporter
classes), though I think the implementation might be a bit dirty. Looking forward to reading your comments.
I think we should support either None/0 indent values or string values, it doesn't make sense to support both. I like string values because they look more extensible and explicit - they are easier to read, and it is easier to add another mode.
I think string values should be syntactic sugar only for settings values, and the internal ItemExporter implementation should receive more consistent values. Personally, I would pick None/Int
values over strings if we were to support only one, I feel i'ts kinda weird for a parameter to receive either an integer or a string.
I think the problem is that there are in fact two different settings, cramped into one setting:
1) how to format the result; 2) how many spaces to use for indentation.
Strings + integers is a way to introduce a single option instead of two options: string values activate (1), while int value activates (2). With two options there'd be 3 string values for the first option, and an int value would be active only when mode is "indent items". So that's where weirdness comes from - instead of explicit 3 possible string values + a separate option for indentation there is a single option with 2 string values which also accepts int.
For me negative and None values instead of strings feels like we're using negative and None values to choose (1) mode; it doesn't seem right - first, we won't be able to add other modes (it is not extendable), and second, why use magic number constants instead of readable string constants?
For me strings + int sounds ok, but maybe two explicit options is better.
I've re-read the discussion this morning.
I now wonder if the compact mode for XML (current behavior, with all items on the same line) makes much sense for users. I do see the point of "lines" output though.
For XML output, I think having the "lines" mode by default (like for JSON) would be fine (but it's backward incompatible).
With that assumption, i.e. that all items on same line not being useful in practice, I would go for a setting saying "lines" (default) or "prettified/beautified" (e.g. FEED_EXPORT_PRETTY=True/False
) and an indent int as another setting (FEED_EXPORT_INDENT
).
The difference between XML and JSON is that new lines are allowed inside XML tags, but in JSON keys or values, where new lines must be escaped. So you can guarantee "one item per line" for JSON, but not for XML. Another difference is that in XML whitespaces and new lines between tags are significant - you can access them in a parser, they affect parse result. In JSON indentation and new lines don't matter, parse result is the same. Just saying :) I'm not an user of XML exports myself.
new lines are allowed inside XML tags
Ok. This is already true with the current "all items on one line" XML (which is not really the case then).
Another difference is that in XML whitespaces and new lines between tags are significant
Ok. But I'm not sure it is a big problem for Scrapy's item exports and their consumers. I don't expect users/parsers of items.xml
files to care about or choke on newlines between <item></item>
tags. Again, this is debatable as with any format breaking change. Thoughts?
@kmike @elacuesta , what do you think of what I commented earlier? The gist being:
<items>
or <item> elements
" to "one newline after <items>
, and one newline between consecutive <item>
: that way, XML output is similar to current default JSON output with opening and closing brackets on their own line, and each JSON item on its own line too ; that would be the new "compact" defaultSounds good to me @redapple!
Alright, I gave it a shot: https://github.com/elacuesta/scrapy/pull/2 Looking forward to your feedback @elacuesta and @kmike .
Hello fellows, thank you for revisiting this PR, I'm sorry I didn't have the time to comment before.
I thought the whole point of separating the beautifying into two settings was to allow for multiple modes of indentation, beyond on/off. If we are not doing that anymore (it seems to me like on/off it's all we need) I'm not sure why we can't stick with the original implementation (following the standard json
behavior).
Don't get me wrong, I really like Paul's approach. I think it's definitely the way to go in the sense that it easily allows the introduction of new modes, I just think that we might not need those modes and the code could be even simpler :-D
@elacuesta ,
now that I check your earlier commits, and after sleeping on it a bit, it does indeed look like using indent>0
for prettifying is sufficient. And simpler is better here: no real need for 2 settings. @kmike's reference to "indent" meaning for stdlib json does make a lot of sense:
If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0, or negative, will only insert newlines. None (the default) selects the most compact representation.
But at the same time, I think this PR should still preserve current formatting of JSON with opening and trailing square brackets on their own lines and each item on their own line too.
For XML, I'll stick to my comment earlier that I would prefer to change the default XML serialization to have individual <item>
elements starting on a new line.
I think I can summarize my updated view as:
<items>
element and closing tag for XML output) should still be on their own lines: this means a change for XML output vs. current behavior;How does that sound? Sorry for the noise and changing of mind. And thanks @elacuesta for being persistent!
By all means, thanks to you :-) I agree completely. I believe this PR, up until the third commit (https://github.com/scrapy/scrapy/pull/2456/commits/c7bb2fa8ce2633d92a7ec2840f84b174a5494428), covers all of that. Maybe we could just remove the fourth and last commit (https://github.com/scrapy/scrapy/pull/2456/commits/a630839d6d098fef27119de25843129f41bf0228), what do you guys think?
I see that https://github.com/scrapy/scrapy/pull/2456/commits/c7bb2fa8ce2633d92a7ec2840f84b174a5494428#diff-0f004a6e06393cc5e29012b83956eed2R538 still has the "all on 1 line" XML
Right, but that's only with FEED_EXPORT_INDENT=None
, which is there for consistency with stdlib json
("None (...) selects the most compact representation"). The default value is FEED_EXPORT_INDENT=0
, which causes each item to go on it's own line.
Alright, so FEED_EXPORT_INDENT=None
also serializes to JSON as [{},{},{}]
without newlines?
Thanks @elacuesta .
I've tested https://github.com/scrapy/scrapy/compare/c7bb2fa8ce2633d92a7ec2840f84b174a5494428?expand=1 locally and it works as I would expect, for None
, 0
or >0
.
I think the only non-obvious thing for me when reviewing the change are the diffs in test_export_no_items_store_empty
: you could even leave it unchanged with the default FEED_EXPORT_INDENT=0
. You seem to cover enough for None
with test_export_indentation
.
Other than that, yeah, https://github.com/scrapy/scrapy/pull/2456/commits/c7bb2fa8ce2633d92a7ec2840f84b174a5494428 looks good to me.
@kmike , what's your opinion on using only 1 setting as @elacuesta comments in https://github.com/scrapy/scrapy/pull/2456#issuecomment-294179629 ?
Thanks for your comments @redapple. I removed the 4th commit and left the branch at c7bb2fa
.
About the changes in test_export_no_items_store_empty
, if I remove the FEED_EXPORT_INDENT=None
I would need to add a \n
to the xml
line, so there would still be a change in there. I think I prefer to explicitely set FEED_EXPORT_INDENT=None
so we don't rely on defaults and know exactly what to expect, what do you think? Thanks again!
Looks good to me, other than a few minor comments!
Hello folks, these are my two cents to fix #1327. I look forward to reading your comments 😄