textwrap should minimize number of breaks in extra long words

python / cpython

The Python programming language

https://www.python.org

Other

63.87k stars 30.57k forks source link

textwrap should minimize number of breaks in extra long words #70402

Open c5d4bb9b-17bb-47bb-a3cc-ef774deba4a9 opened 8 years ago

c5d4bb9b-17bb-47bb-a3cc-ef774deba4a9 commented 8 years ago

BPO	26214
Nosy	@stevendaprano, @serhiy-storchaka, @tirkarthi, @iritkatriel, @akulakov

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'library', '3.9', '3.10', '3.11'] title = 'textwrap should minimize number of breaks in extra long words' updated_at = user = 'https://bugs.python.org/TuomasSalo' ``` bugs.python.org fields: ```python activity = actor = 'andrei.avk' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'Tuomas Salo' dependencies = [] files = [] hgrepos = [] issue_num = 26214 keywords = [] message_count = 6.0 messages = ['258999', '376368', '376387', '376388', '376395', '409241'] nosy_count = 6.0 nosy_names = ['steven.daprano', 'serhiy.storchaka', 'Tuomas Salo', 'xtreak', 'iritkatriel', 'andrei.avk'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue26214' versions = ['Python 3.9', 'Python 3.10', 'Python 3.11'] ```

c5d4bb9b-17bb-47bb-a3cc-ef774deba4a9 commented 8 years ago

This code:

    import textwrap
    textwrap.wrap("123 123 1234567", width=5)

currently* produces this output:

['123', '123 1', '23456', '7']

I would expect the textwrap module to only break words when absolutely necessary. That is, I would have expected it to produce one break less:

['123', '123', '12345', '67']

This is of course a matter of taste - the current implementation produces more efficiently filled lines.

(* I only have access to Python 2.7 and 3.4)

iritkatriel commented 4 years ago

You can do this already with the break_long_words arg of testwrap:

>>> import itertools, textwrap
>>> wr = textwrap.wrap
>>> list(itertools.chain(*(wr(x, 5) for x in wr("123 123 1234567", width=5, break_long_words=False))))
['123', '123', '12345', '67']

serhiy-storchaka commented 4 years ago

The code with nested wraps is awesome. But it does not work well.

>>> list(itertools.chain(*(wr(x, 5) for x in wr("123 123 1234567 12", width=5, break_long_words=False))))
['123', '123', '12345', '67', '12']

It is expected that '67' and '12' should be in the same line: '67 12'.

iritkatriel commented 4 years ago

One more wrap:

>>> wr(' '.join(itertools.chain(*(wr(x, 5) for x in wr("123 123 1234567 12", width=5, break_long_words=False)))), 5)
['123', '123', '12345', '67 12']

iritkatriel commented 4 years ago

To clarify, this solution is a linear-time greedy one, with three passes:

the first pass puts each long word on its own line.
the second pass chops them up into words of at most width characters.
the third pass wraps them, when there are no more long words.

This minimizes the number of breaks within words. It doesn't minimize the number of output lines (you'd need a dynamic programming programming algo for that - O(n^2)). So for this input:

wr("123 12 123456 1234", 5)

you will get ['123', '12', '12345', '6', '1234']

where you may (or may not) have preferred:

['123', '12 1', '23456', '1234']

akulakov commented 2 years ago

It may be worth fixing wrap() to do the nicer style of wrapping for long words. If we decide to do that, it should be done via a new parameter because the same logic (TextWrapper class) is used for shorten and in that case it may be preferable to have the chunk of longer word rather than cutting it out entirely.