scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

[MRG+1] Added option to turn off ensure_ascii for JSON exporters #2034

Closed dracony closed 7 years ago

dracony commented 7 years ago

See #1965

dracony commented 7 years ago

For some reason I just can't get the tests to pass on Py3, so annoying... And I don't have a local install of it atm, so I guess I'll be torturing Travis for a while)

codecov-io commented 7 years ago

Current coverage is 83.37%

Merging #2034 into master will increase coverage by 0.03%

Powered by Codecov. Last updated by 8a22a74...33a39b3

dracony commented 7 years ago

Fixed it finally )

dracony commented 7 years ago

Any feedback for it? Would really love to finally have this in my scraper )

redapple commented 7 years ago

@dracony , thanks for this. how will people use this? use their own JSON exporter subclass? if yes, how about introducing a setting for this? it could be simpler. Also, @mgachhui was suggesting having an encoding parameter, not only default UTF-8 (and applying this to XML exports too)

dracony commented 7 years ago

And here is the FEED_EXPORT_ENCODING setting =) Took me a while, I so hate all that encoding stuff))

dracony commented 7 years ago

Ah damn it, The CSV tests fail on Py3 because StringIO is behaving differently =( Any idea how to fix this?

dracony commented 7 years ago

@redapple Yay, I made it work. Like it more now?

dracony commented 7 years ago

@robsonpeixoto Thanks, missed those. I amended the commit to pep8-ize it =)

dracony commented 7 years ago

Any other feedback on it? Would really be great to have it merged (I work with quite a lot of cp-1251 encoded stuff)

dracony commented 7 years ago

If you know of any other way to make it work on py2 and 3 id be happy to implement ut. Although i think the current one is best

dracony commented 7 years ago

@redapple True, it seems I overcomplicated it a bit when I was trying to make the test pass. I removed those lines and the tests still pass 👍

dracony commented 7 years ago

@redapple @robsonpeixoto Ping =) Can this get merged now?

eliasdorneles commented 7 years ago

Hey @dracony -- sorry for the delay in reviewing, I've made some comments above.

The main thing not clear to me are the changes in the CSV exporter -- is line buffering really needed? It would be better to have it in a second PR (less decisions to be done to merge this one :) ), it might deserve its own setting, since flushing at every line can be slower.

eliasdorneles commented 7 years ago

Also, if you could revise the style once more, there are some minor stuff there (try running flake8 scrapy/exporters.py). Not a big deal, but makes it nicer for the reviewer's brain to parse it. :) Thanks!

dracony commented 7 years ago

@eliasdorneles The change https://github.com/scrapy/scrapy/pull/2034#discussion-diff-69482186L203 was indeed irrelevant and changes nothing. It was a sideeffect with me fighting with the CSV writer.

I gave it one more try and made the CSV writer work without changing encoding line by line (although in my defence I got that approach from the csv manual page)

As for the line_buffering being on it does seem to be required since turning it off broke the test entirely. I would be happy to skip the TextIOWrapper altogether, but the only way to do that would be reopening the file again with the encoding parameter specified. And considering it's the file object that is being passed to the exporter, closing and reopening the file seems more like a hack

eliasdorneles commented 7 years ago

@dracony Using TextIOWrapper is totally fine, the current PY3 code is already using it. I still think it's weird why line_buffering is being required for PY3, it looks like we'll have different behavior in PY2. Which test failed?

Please see my other comment about passing ensure_ascii parameter -- it's best to use the kwargs.setdefault approach to maintain compatibility with subclasses. Aside that little nitpicking, the PR looks good to me, nice work! :+1:

dracony commented 7 years ago

@eliasdorneles I changed the ensure_ascii to kwargs as you suggested. Also I found a solution for why line_buffering False was failing tests. The solution was to also enable write_through, otherwise the buffer was not flushed to disk

eliasdorneles commented 7 years ago

@dracony hey, thanks for digging! :+1: Hm, that kinda makes me suspicious of the tests, like they may be trying to read without having guarantee the file is flushed. I don't think we should need to be forcing flush (we don't seem to need it for the PY 2 code)... I'll have a look to confirm, thank you!

dracony commented 7 years ago

@eliasdorneles It's not the file that is nt being flushed, it's that TextIOBuffer

eliasdorneles commented 7 years ago

Yup, you're right. :) Alright, this looks good now! Thank you, @dracony !

dracony commented 7 years ago

Yay =)

On Mon, Jul 11, 2016 at 5:24 PM, Elias Dorneles notifications@github.com wrote:

Yup, you're right. :) Alright, this looks good now! Thank you, @dracony https://github.com/dracony !

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scrapy/scrapy/pull/2034#issuecomment-231768335, or mute the thread https://github.com/notifications/unsubscribe/AC-3KI7FN-GaAVBW08_ZCtXABujFYKFdks5qUmBFgaJpZM4Iu3Yz .

dracony commented 7 years ago

Ah I see. Fixed that

redapple commented 7 years ago

@dracony , thanks a lot! It took a while to review it properly, sorry about that.