ppannuto / python-titlecase

Python library to capitalize strings as specified by the New York Times Manual of Style
MIT License
254 stars 38 forks source link

Read from an abbreviations file outside of titlecase() and only when invoked from command line #60

Closed igorburago closed 4 years ago

igorburago commented 4 years ago

If the titlecase() function is called directly (say, from a library), the caller can still use a custom list of abbreviations by passing an appropriate function as callback.

Fixes #58.

coveralls commented 4 years ago

Coverage Status

Coverage decreased (-3.9%) to 81.218% when pulling 2882fc63eb5cba0740dcd90bcafb2d50a1d5165b on iburago:no-wordlist-arg into c88fb220916e926e22d6764f31c78b66387b21f2 on ppannuto:master.

coveralls commented 4 years ago

Coverage Status

Coverage decreased (-3.7%) to 81.443% when pulling bc0fbca34af2e1853aeb26d7a1f6d4667b98e0f9 on iburago:no-wordlist-arg into c88fb220916e926e22d6764f31c78b66387b21f2 on ppannuto:master.

igorburago commented 4 years ago

If we choose this PR over #59, we might want to rename create_wordlist_filter() to create_wordlist_filter_from_file() (or something similar), and allow the user to import it to simplify creating the corresponding callback function.

But we can even keep that function private, of course, unless we expect a great number of users who are calling the titlecase() function directly—rather than using the command line tool—to load their abbreviations lists from files. In that case, we would save the user some time and effort in reimplementing reading the list from file and creating a callback.

Here I keep this function importable, but not added to __all__.

igorburago commented 4 years ago

As a side note, I am not sure it is a great idea to have that logger.debug() call in the titlecase(). When the titlecase() is called frequently, it might have a small performance cost. However tiny it is, it still might be undesirable for such a small library function like titlecase().

I would remove that logging call or, at least, would protect it with if __debug__:.

I am also not sure that we really need any logging in create_wordlist_filter, but I’m fine with keeping it there (assuming there are reasons for that).

1kastner commented 4 years ago

I don't believe that this library targets high-performance use cases. Moreover, one old programmer's wisdom: Only optimize if you need to optimize! As long as there is no urge, it might be a waste of time and it might clutter the code. I could be persuaded by proper benchmarks though. In addition, have you checked the logging library whether they maybe already have something similar in place? I guess they did optimize their code.

1kastner commented 4 years ago

To be honest, for my use case I only care about the command line interface so the internals do not bother me so much as long as nothing breaks, there are tests, and everything is nicely documented.

igorburago commented 4 years ago

Regarding logging:

If Python is started with the -O option, the code protected by if __debug__: is optimized out when the code is loaded into the interpreter and, therefore, has no performance cost during interpretation later on.

The logger’s debug() method in the logging module does check whether the debug level of logging is enabled, and does not log the given message otherwise. But one still pays the (however small) performance cost of the function call and the log-level check.

It is unfortunate that no benchmarks were made when this logging code was added. It is adding code that may clutter, not removing.

I do not have time to do benchmarks at the moment, so, as I said previously, I would be all right with keeping the logging code for now, as long as we merge this PR (which makes no modifications to the logging code).

(I brought up logging as a side node, just out of curiosity.)

igorburago commented 4 years ago

As to the subject of this PR, your interest is not affected, then, since:

  1. The command-line interface remains unchanged with the command having exactly the same default behavior of reading the ~/.titlecase.txt abbreviations file as before.

  2. Those who are using the package as a library instead (by calling the titlecase() function directly), still have the ability to protect abbreviations loaded from a file by specifying an appropriate callback. The create_wordlist_filter_from_file() function provides an easy way for the user to make a callback with that same original functionality, so that calls like

    titlecase(str, wordlist_filter=file_path)

    simply change to

    titlecase(str, callback=create_wordlist_filter_from_file(file_path))
  3. The test for the wordlist-based filtering in the tests.py file was modified to use the new approach (just like shown above). The tests are all passed.

igorburago commented 4 years ago

Although I have touched on the advantages of this PR in #58, let me reiterate on them here a bit more elaborately.

  1. Separation of concerns. The titlecase() function is kept to its main purpose of title-casing a given string, and is not burdened by any particular abbreviation-list mechanism being embedded into it.

  2. Simplification of the API without loss of functionality. The titlecase() function still has (the original) mechanism for supporting any exceptions in capitalization via the callback parameter. That is, it uses a single interface for all kinds of custom words, regardless of wether they were loaded from a file or populated any other way.

  3. Abbreviation file parsing is still available through a callback. The create_wordlist_filter_from_file() function is kept around to support the same use cases (with the primary instance being the command implementation in cmd()). The only change is that the call to create_wordlist_filter_from_file() now happens outside of titlecase(), instead.

  4. Flexibility in composing multiple capitalization-exception sources. When all custom capitalization code is shaped into those callback functions, the user can easily compose them in any order, while currently callback is always called before wordlist_filter.

1kastner commented 4 years ago

That sounds great to me!

ppannuto commented 4 years ago

This makes a lot of sense to me. I was probably a little quick to move on pushing out the prior iteration of this, it had just sat for a while so I felt bad.

Strictly speaking, I think this will also require a 2.0 release as it changes the interface; not that I think a ton of folks have pulled the 1.0 version yet, but I would like to try to be a good citizen of the semver universe.

igorburago commented 4 years ago

Changing the major version number due to the change of the API makes total sense.

However, unless you have to, could you please postpone making a new release? I have another couple of pull requests in the works for cleaning up the regular-expression constants, for making it easier for the user to provide the list of custom small words, and, possibly, for doing some minor refactoring.

ppannuto commented 4 years ago

No urgency to release here. Excited to see the new PRs!