scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

[MRG+1] Feed exports: beautify JSON and XML #2456

Closed elacuesta closed 7 years ago

elacuesta commented 7 years ago

Hello folks, these are my two cents to fix #1327. I look forward to reading your comments 😄

codecov-io commented 7 years ago

Codecov Report

Merging #2456 into master will increase coverage by 0.08%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2456      +/-   ##
==========================================
+ Coverage   84.66%   84.75%   +0.08%     
==========================================
  Files         162      162              
  Lines        9122     9155      +33     
  Branches     1353     1358       +5     
==========================================
+ Hits         7723     7759      +36     
- Misses       1141     1145       +4     
+ Partials      258      251       -7
Impacted Files Coverage Δ
scrapy/settings/default_settings.py 98.64% <100%> (ø) :arrow_up:
scrapy/exporters.py 100% <100%> (ø) :arrow_up:
scrapy/extensions/feedexport.py 72.95% <100%> (+0.42%) :arrow_up:
scrapy/commands/parse.py 21.15% <0%> (ø) :arrow_up:
scrapy/shell.py 68.61% <0%> (ø) :arrow_up:
scrapy/pipelines/files.py 71.18% <0%> (ø) :arrow_up:
scrapy/utils/misc.py 95% <0%> (ø) :arrow_up:
scrapy/utils/datatypes.py 60.21% <0%> (ø) :arrow_up:
scrapy/core/downloader/handlers/http11.py 92.12% <0%> (+1.28%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7fc11c1...3a0a86e. Read the comment docs.

redapple commented 7 years ago

I like the feature and implementation. XML out definitely gets easier to read. However, I'm not sure about having prettified-JSON as default. On the other hand, everyone uses jq these days right? And it would be weird to prettify XML by default, and not JSON. I'll have to sleep on it.

redapple commented 7 years ago

@elacuesta , I believe it also needs tests for non-default (None, 2, etc.) widths.

elacuesta commented 7 years ago

Hey @redapple, thanks for your comments. I added a few tests with distinct width values. Regarding the default, I set it to 4 following @kmike's comment and my own desire, but I think it would be just fine to leave it optional.

elacuesta commented 7 years ago

Alright, I set FEED_EXPORT_INDENT_WIDTH=None to maintain backwards compatibility.

Ping @redapple, @kmike: I'm sorry to bother you guys, but do you have any further comments on this? Thanks!

redapple commented 7 years ago

Looks good to me @elacuesta

redapple commented 7 years ago

Looking at the new tests, I think this change reverts https://github.com/scrapy/scrapy/pull/1950 .

elacuesta commented 7 years ago

You're right Paul. I personally like the square brackets having their own line even if indent=None, I made this latest change as per Mikhail's suggestion of following stdlib JSONEncoder's behaviour. I think diverting from the standard in some very particular cases (like this one, or the newline after the XML declaration tag) is not that big of a deal, I can make the brackets have their own line again if you guys agree. Thanks for your time 😀

kmike commented 7 years ago

Previous way of exporting to JSON kind of makes sense; it seems it becomes unsupported after this PR: no indentation inside items, but each item is on a new line. It is not only a question about brackets. It can be seen as a backwards-incompatible change: wc -l will stop working as a way to count exported items in json output.

kmike commented 7 years ago

What do you think about making indent option work only inside individual items in case of json, not for the whole output? That'd be a bit weird though..

elacuesta commented 7 years ago

I think we all agree that the current JSON output (one item per line) is just fine. However it's not consistent with the current XML output (all items in one line). What would you say if I modified this PR to make both outputs consistent in the following manner:

That would be a little deviation from the "standard" in the sense that indent=0 would not imply newlines everywhere, but I believe it would be worth in this case. Thoughts?

kmike commented 7 years ago

It sounds fine to me. Alternatively, we can support string constants for indent:

Though 'indent' argument name could be confusing with these value names

elacuesta commented 7 years ago

Updated according to https://github.com/scrapy/scrapy/pull/2456#issuecomment-284468472. I also added support for string values (at the setting level, not in the ItemExporter classes), though I think the implementation might be a bit dirty. Looking forward to reading your comments.

kmike commented 7 years ago

I think we should support either None/0 indent values or string values, it doesn't make sense to support both. I like string values because they look more extensible and explicit - they are easier to read, and it is easier to add another mode.

elacuesta commented 7 years ago

I think string values should be syntactic sugar only for settings values, and the internal ItemExporter implementation should receive more consistent values. Personally, I would pick None/Int values over strings if we were to support only one, I feel i'ts kinda weird for a parameter to receive either an integer or a string.

kmike commented 7 years ago

I think the problem is that there are in fact two different settings, cramped into one setting:

1) how to format the result; 2) how many spaces to use for indentation.

Strings + integers is a way to introduce a single option instead of two options: string values activate (1), while int value activates (2). With two options there'd be 3 string values for the first option, and an int value would be active only when mode is "indent items". So that's where weirdness comes from - instead of explicit 3 possible string values + a separate option for indentation there is a single option with 2 string values which also accepts int.

For me negative and None values instead of strings feels like we're using negative and None values to choose (1) mode; it doesn't seem right - first, we won't be able to add other modes (it is not extendable), and second, why use magic number constants instead of readable string constants?

For me strings + int sounds ok, but maybe two explicit options is better.

redapple commented 7 years ago

I've re-read the discussion this morning. I now wonder if the compact mode for XML (current behavior, with all items on the same line) makes much sense for users. I do see the point of "lines" output though. For XML output, I think having the "lines" mode by default (like for JSON) would be fine (but it's backward incompatible). With that assumption, i.e. that all items on same line not being useful in practice, I would go for a setting saying "lines" (default) or "prettified/beautified" (e.g. FEED_EXPORT_PRETTY=True/False) and an indent int as another setting (FEED_EXPORT_INDENT).

kmike commented 7 years ago

The difference between XML and JSON is that new lines are allowed inside XML tags, but in JSON keys or values, where new lines must be escaped. So you can guarantee "one item per line" for JSON, but not for XML. Another difference is that in XML whitespaces and new lines between tags are significant - you can access them in a parser, they affect parse result. In JSON indentation and new lines don't matter, parse result is the same. Just saying :) I'm not an user of XML exports myself.

redapple commented 7 years ago

new lines are allowed inside XML tags

Ok. This is already true with the current "all items on one line" XML (which is not really the case then).

Another difference is that in XML whitespaces and new lines between tags are significant

Ok. But I'm not sure it is a big problem for Scrapy's item exports and their consumers. I don't expect users/parsers of items.xml files to care about or choke on newlines between <item></item> tags. Again, this is debatable as with any format breaking change. Thoughts?

redapple commented 7 years ago

@kmike @elacuesta , what do you think of what I commented earlier? The gist being:

kmike commented 7 years ago

Sounds good to me @redapple!

redapple commented 7 years ago

Alright, I gave it a shot: https://github.com/elacuesta/scrapy/pull/2 Looking forward to your feedback @elacuesta and @kmike .

elacuesta commented 7 years ago

Hello fellows, thank you for revisiting this PR, I'm sorry I didn't have the time to comment before.

I thought the whole point of separating the beautifying into two settings was to allow for multiple modes of indentation, beyond on/off. If we are not doing that anymore (it seems to me like on/off it's all we need) I'm not sure why we can't stick with the original implementation (following the standard json behavior).

Don't get me wrong, I really like Paul's approach. I think it's definitely the way to go in the sense that it easily allows the introduction of new modes, I just think that we might not need those modes and the code could be even simpler :-D

redapple commented 7 years ago

@elacuesta , now that I check your earlier commits, and after sleeping on it a bit, it does indeed look like using indent>0 for prettifying is sufficient. And simpler is better here: no real need for 2 settings. @kmike's reference to "indent" meaning for stdlib json does make a lot of sense:

If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0, or negative, will only insert newlines. None (the default) selects the most compact representation.

But at the same time, I think this PR should still preserve current formatting of JSON with opening and trailing square brackets on their own lines and each item on their own line too. For XML, I'll stick to my comment earlier that I would prefer to change the default XML serialization to have individual <item> elements starting on a new line.

I think I can summarize my updated view as:

How does that sound? Sorry for the noise and changing of mind. And thanks @elacuesta for being persistent!

elacuesta commented 7 years ago

By all means, thanks to you :-) I agree completely. I believe this PR, up until the third commit (https://github.com/scrapy/scrapy/pull/2456/commits/c7bb2fa8ce2633d92a7ec2840f84b174a5494428), covers all of that. Maybe we could just remove the fourth and last commit (https://github.com/scrapy/scrapy/pull/2456/commits/a630839d6d098fef27119de25843129f41bf0228), what do you guys think?

redapple commented 7 years ago

I see that https://github.com/scrapy/scrapy/pull/2456/commits/c7bb2fa8ce2633d92a7ec2840f84b174a5494428#diff-0f004a6e06393cc5e29012b83956eed2R538 still has the "all on 1 line" XML

elacuesta commented 7 years ago

Right, but that's only with FEED_EXPORT_INDENT=None, which is there for consistency with stdlib json ("None (...) selects the most compact representation"). The default value is FEED_EXPORT_INDENT=0, which causes each item to go on it's own line.

redapple commented 7 years ago

Alright, so FEED_EXPORT_INDENT=None also serializes to JSON as [{},{},{}] without newlines?

elacuesta commented 7 years ago

Yes, here's the test case: https://github.com/scrapy/scrapy/pull/2456/commits/c7bb2fa8ce2633d92a7ec2840f84b174a5494428#diff-0f004a6e06393cc5e29012b83956eed2R464

redapple commented 7 years ago

Thanks @elacuesta . I've tested https://github.com/scrapy/scrapy/compare/c7bb2fa8ce2633d92a7ec2840f84b174a5494428?expand=1 locally and it works as I would expect, for None, 0 or >0. I think the only non-obvious thing for me when reviewing the change are the diffs in test_export_no_items_store_empty: you could even leave it unchanged with the default FEED_EXPORT_INDENT=0. You seem to cover enough for None with test_export_indentation. Other than that, yeah, https://github.com/scrapy/scrapy/pull/2456/commits/c7bb2fa8ce2633d92a7ec2840f84b174a5494428 looks good to me.

redapple commented 7 years ago

@kmike , what's your opinion on using only 1 setting as @elacuesta comments in https://github.com/scrapy/scrapy/pull/2456#issuecomment-294179629 ?

elacuesta commented 7 years ago

Thanks for your comments @redapple. I removed the 4th commit and left the branch at c7bb2fa. About the changes in test_export_no_items_store_empty, if I remove the FEED_EXPORT_INDENT=None I would need to add a \n to the xml line, so there would still be a change in there. I think I prefer to explicitely set FEED_EXPORT_INDENT=None so we don't rely on defaults and know exactly what to expect, what do you think? Thanks again!

kmike commented 7 years ago

Looks good to me, other than a few minor comments!