sphinx-doc / sphinx

The Sphinx documentation generator
https://www.sphinx-doc.org/
Other
6.22k stars 2.04k forks source link

[config] Support more configurable/precise build-year copyright substitution. #12451

Open jayaddison opened 3 weeks ago

jayaddison commented 3 weeks ago

Describe the bug

Projects can configure multiline copyright notices (ref #4925), a useful feature for projects that have transitions between copyright holders over time.

However, the current config year-substitution logic enabled when SOURCE_DATE_EPOCH is configured will attempt to pattern-match and substitute the end-year from the build-year in all of the copyright lines.

I'm not sure that's intended; I think that we should only apply the substitution for the most recent (which for an iterable I suppose means the final) line in the notice.

How to Reproduce

conf.py

copyright = [
    '2000-2001 - first copyright holder',
    '2002-2003 - second copyright holder',
]

index.rst

Example Document

Expected

['2000-2001 - first copyright holder', '2002-2024 - second copyright holder']

Actual

['2000-2024 - first copyright holder', '2002-2024 - second copyright holder']

Environment Information

Platform:              linux; (Linux-6.8.12-arm64-aarch64-with-glibc2.38)
Python version:        3.11.9 (main, Apr 10 2024, 13:16:36) [GCC 13.2.0])
Python implementation: CPython
Sphinx version:        7.4.0+/6471027d1
Docutils version:      0.21.2
Jinja2 version:        3.1.4
Pygments version:      2.18.0

Sphinx extensions

N/A

Additional context

Discovered while investigating whether multiline copyright config could be a way to reduce risk of rewriting unintended copyright lines re: any interaction between matplotlib/matplotlib#28418 and #12450.

jayaddison commented 3 weeks ago

Although I feel like I want to make adjustments here and may have ideas about how to do that, I'd appreciate input from other maintainers/developers because churn/incorrect fixes here could potentially be more disruptive than simply leaving the code-is.

AFAIK we don't currently have a way for downstream projects to entirely disable the dynamic copyright substitution logic when SOURCE_DATE_EPOCH is enabled.

The matplotlib project is not affected; they do have a multi-statement copyright notice, but all the statements are configured in a single str (not a list) copyright config value, and, crucially, the first line of that text uses a Unicode en-dash instead of an ASCII dash (minus symbol), which Sphinx doesn't match on during substitution.

jayaddison commented 3 weeks ago

Also: maintainers: please feel free to convert this into a GitHub discussion if that's a more appropriate venue for this thread.

jayaddison commented 3 weeks ago

I'm not convinced that my assertion that only the most-recent/last copyright line should be updated is always true.

Although updating every line does seem incorrect in some situations, it seems unclear/intractable how to determine which lines to update. This could be a reason why some projects have chosen to avoid the substitution logic by using en-dash.

In terms of ideas for routes towards improvements, though -- and bearing in mind that, as far as possible, we should avoid breaking existing projects:

Perhaps we could provide a {{current_year}} helper variable so that documentation projects that do want dynamic year values could opt-into a documented, somewhat-supported way to dynamically retrieve that build-time value (with support for SOURCE_DATE_EPOCH).

Projects that do want dynamic copyright notices could then use that variable in either their single-line or multi-line copyright notices, as appropriate for their use case(s).

As mentioned in a previous thread I'm not too keen on dynamic copyright notices personally, but we do have practical examples of Sphinx projects that use them, and a way to retrieve the current year could help us support those cases. If I were to implement that, I'd document the substitution variable with a note to indicate that it is available for use, but not necessarily recommended.

jayaddison commented 2 weeks ago

I'd propose a combination of two changes:

  1. Add a config setting to allow downstream sites to disable copyright year substitution entirely if they choose.
  2. Add the ability to substitute a specific {{ current_year }} pattern within the copyright notice, so that sites can opt-in to current-year substitution with more control, compatible with both single-line and multi-line copyright notices.

I'll attempt to draft a pull request for this soon (within the next week or two).

jayaddison commented 1 week ago

Exploratory thoughts: in some ways I think that using the last-modified-time (aka mtime) of the most recent source document would be a better value to use during current-year copyright notice substitution. However, from practical experience, and for understandable convenience, many Sphinx projects build directly from git clone (checkout of source control) of their documentation sources -- and that cloning process does not preserve the last-modification time of files (in other words: a file last edited in source control two years ago, when retrieved in a clone of the containing repository today, will have today's date in its timestamps (created and modified)).

(context: if we are to add current_year replacement -- something I'm weighing up the pros and cons and moral qualms about - it would be good to make that value reasonably-accurate in standard/typical use cases, and relatively difficult to unwittingly build with a nonsensical value. a nice side-effect (although not the design goal) of using latest-modified-file time as the value would be that the emitted current-year would reflect the last time the source work was indicated as modified)

Although technically feasible, I don't think that it would make sense to attempt to add code to determine whether the source files are contained in a version control system (in order to detect modification times) -- doing that would mean that we end up maintaining code to handle an arbitrary and incomplete list of version control systems.

jayaddison commented 1 week ago

An incomplete, approximated code search for path:conf.py /^copyright.*today/ using GitHub code search today (20240705) returns over four thousand results. No doubt some of those are duplicates, but even so I think this indicates that there is a relatively frequent desire/use-case for current-year in Sphinx copyright notice config.

One reason that I checked that is to determine whether a harsher/breaking-change approach of disabling dynamic copyright notices would be acceptable. Initially I feel that it is not -- and also it would simply have the effect of pushing the problem elsewhere rather than solving it.

Given that context, the current use of SOURCE_DATE_EPOCH-based replacement to achieve replacement selectively in the case of reproducible builds seems like a pragmatic approach.

Possible improvements remaining in my mind:

There is also a risk of trying to be too clever and creating unnecessary complexity.

jayaddison commented 1 week ago
* Could/should we detect dynamic vs static `copyright` config, and only perform substitution in the dynamic case?

Based on my understanding of Python: detecting statically-declared strings may be possible --we could locate and parse the conf.py and then apply some heuristic/AST checks -- but figuring out whether an f-string, for example, actually varies or not in practice is not feasible. So: it would be somewhat complex, and would have some false-positives (string considered dynamic although it may not vary). On the plus side, this would allow many simple, statically-configured projects to build reproducibly and correctly without modifications.

* In the case of multiline statements could/should we detect end-years _not_ to modify?  (for example: perhaps we should only update the largest year values?)

In the general case this seems intractable, because there's very little that we can infer from copyright year ranges in multiline statements. Did contributions stop because of a change of ownership? Or simply because a contributor no longer provides edits to a project? We can't know those during isolated computerized builds from source.

However, it's possible that we could improve the substitution heuristics; for example: we could determine the system-clock year and only perform replacement of years in the notice that match that. The only edge case I can think of there is dynamic copyright values that insert a different year -- but I can't think of a valid reason for a project to do that.

jayaddison commented 1 week ago

From my mental notes, three ideas remain valid, and here's how I rate them currently:

Edit: update checklist x2

jayaddison commented 1 week ago

From my mental notes, three ideas remain valid, and here's how I rate them currently:

An additional fourth idea that also remains valid:

Edit: update checklist