mk-fg / pretty-yaml

PyYAML-based module to produce a bit more pretty and readable YAML-serialized data
Do What The F*ck You Want To Public License
135 stars 29 forks source link

I don't understand why we can't pretty print long unicode strings with newlines #12

Closed fiatjaf closed 9 years ago

fiatjaf commented 9 years ago
>>> print pyaml.dump({'a': 'asldnsa\nasldknsa\n'})
a: |
  asldnsa
  asldknsa
>>> print pyaml.dump({'a': u'asldnsa\nasldpáknsa\n'})
a: "asldnsa\nasldp\xE1knsa\n"

I tried to read the source and inspect it, but couldn't figure out where is the problem. Is this solvable? If I know all the values I'm trying to print are UTF-8 strings, not bytes or anything else, is it possible to patch pyaml to allow me to print them beautifully? How?

mk-fg commented 9 years ago

Hmm, yeah, I think it should totally be possible... somehow. Don't really remember the magic myself, but iirc something like style='|' should be passed to yaml emitter function for this particular object. I suspect that it might be the case that some hack is missing for unicode that is there for bytes.

I'll look into this when I'll get a chance, thanks for reporting.

fiatjaf commented 9 years ago

I tried passing string_val_style='|', but it was the same. The problem is that when we reach https://github.com/mk-fg/pretty-yaml/blob/5930dfcefb86b26d7b461234b9ebcd3e311afd39/pyaml/__init__.py#L98-L113 it seems nothing more can be done, because yaml.representer.ScalarNode('tag:yaml.org,2002:str', data, style="|") generates the ugly string whenever there are unicode characters.

But I don't know. Nothing makes sense to me.

Anyway, thank you very much for this library.

mk-fg commented 9 years ago

Ah, unfortunate, but I'm sure it can be worst-case copy-pasted from the yaml code and jury-rigged to produce something nice, though hopefully it won't be that sad.

Thanks!

mk-fg commented 9 years ago

Wow, looking into this issue I discovered a true gem in PyYAML that I and apparently some other people don't know about: allow_unicode=True option of Emitter.

In [2]: yaml.dump({'a': u'asldnsa\nasldpáknsa\n'}, sys.stdout, allow_unicode=True)
{a: 'asldnsa

    asldpáknsa

    '}

It's still not perfect, but such unicode stuff was one of the main reasons why I bothered with this bag of hacks, and apparently it's fixable with just this one option (which existed since the dawn of times too!).


Was able to easily fix the multiline-unicode thing with it - that option comes into play in Emitter.analyze_scalar method, which sets special_characters=True for any unicode stuff without it.

Also added "IMPORTANT" note to the README, as I suspect just that option can probably be used instead of this whole module for most people. Yay!

Thanks again for pointing to this thing, I only wish you came by it waaaay earlier ;)

fiatjaf commented 9 years ago

Haha, I'm sorry I didn't. I still don't understand the implications of this finding, but I'm pretty sure your library is still needed. Look:

>>> print pyaml.dump({u'a': u'asldnsa\nasldpknsa\n'})
a: |
  asldnsa
  asldpknsa

>>> print yaml.dump({u'a': u'asldnsa\nasldpknsa\n'}, allow_unicode=True)
{!!python/unicode 'a': !!python/unicode 'asldnsa

    asldpknsa

    '}
mk-fg commented 9 years ago

With other options it can work better:

In [6]: yaml.safe_dump({'a': u'asldnsa\nasldpáknsa\n'}, sys.stdout, allow_unicode=True, default_flow_style=False, default_style='|')
"a": |
  asldnsa
  asldpáknsa

Though it'd be harder to pick nice-ish params in this one place for larger dumps, as e.g. even here I'd rather have default_style=None for dict keys and default_style='|' for that multiline value.

Note also yaml.safe_dump - that one won't add any python types, as these will never look good or be useful for simple data.

Still, I'd say that dump above is pretty nice as it is, maybe good enough to avoid adding extra module dep.

mk-fg commented 9 years ago

Oh, and yeah, to understand implications I meant above better, just look at all the examples in the README, where yaml does this weird \u0414\u043B\u0438\u043D\u043D\u044B\u0439... stuff - none of that with this option!

I'll need to correct all these to be fair to PyYAML already supporting this stuff properly, just me not being able to see it.