pypa / bandersnatch

A PyPI mirror client according to PEP 381 http://www.python.org/dev/peps/pep-0381/
Academic Free License v3.0
453 stars 141 forks source link

Many options do not have a configured fallback value even if the documentation says so #990

Closed ichard26 closed 5 months ago

ichard26 commented 3 years ago

What's the issue:

Some of the configurable knobs don't actually have a fallback value defined even though the documentation explicitly states these values do have a default. Case in point, let's take a gander at the [mirror].timeout value -- it's documented as having a default of 10 seconds:

The default value for this setting is 10 seconds. https://bandersnatch.readthedocs.io/en/latest/mirror_configuration.html#timeout

buttt looking at the source code it actually doesn't have one: https://github.com/pypa/bandersnatch/blob/3b78e040a330e3308389e3d32d8c13e8b8c41af5/src/bandersnatch/mirror.py#L1012-L1014

Potential solutions:

I personally prefer the first one since having defaults makes sense so fixing the code to actually implement them is a good idea ... although this is a good opportunity to do a configuration adiut and see if any options shouldn't exist / or have a default / should have their default changed.

Additional context:

I was notified of this issue from a quick exchange in the Python Discord server: https://discord.com/channels/267624335836053506/463035462760792066/877942046952947722

cooperlees commented 3 years ago

The quoting of default refers to the default.conf file have in the repo for people to use to get started.

jbg commented 1 year ago

I just hit this same issue — I wrote a config file from scratch based on the documentation, and was surprised to find that "the default value for this setting is [some value]" didn't mean that the setting had a default value. I never even saw default.conf until I went to file an issue about the defaults not working as documented and found this issue.

cooperlees commented 1 year ago

PRs to link docs to default.conf and make configparser set default both welcome.

flyinghyrax commented 9 months ago

I've started looking at the documentation and code for mirror configuration. These are the options I've found so far:

Option Name Type Doc'd? Default? (Doc/Actual) Source File
json bool yes - / False configuration.py
root_uri str yes "https://files.pythonhosted.org/"[^1] "
diff-file str yes - / "" "
diff-append-epoch bool yes - / False "
storage-backend str no[^2] "filesystem"/"filesystem" "
digest_name str no - / "sha256" "
cleanup bool no - / False "
release-files bool yes True / True "
compare-method str yes "hash"/ "hash" "
download-mirror str yes - / "" "
download-mirror-no-fallback bool no - / False "
simple-format str yes "ALL" / "ALL" "
master str yes "https://pypi/org" / !!! mirror.py
timeout int yes 10 / !!! "
global-timeout int yes 18000 / !!! "
proxy str yes - / "" "
stop-on-error bool yes - / False "
workers int yes 3 / !!! "
hash-index bool yes False / !!! "
keep_index_versions bool no - / 0 "

There are a handful that aren't documented (that is, at least not on the mirror configuration page), as well as the ones that are set in default.conf but the don't have a default/fallback when a user's config file is loaded. I intend to update mirror_configuration.md in the docs as well as look into adding defaults in code for the options that don't have one.

Presently there is a function in configuration.py that validates most of the options while others are only read in mirror.py. I don't know if it would be a problem to consolidate all configuration loading into one place.

[^1]: root_uri's default value interacts with release-files. The description of the default in the docs matches the code, which is in pseudocode something like:

      if not release-files and not root-uri then
        "https://files.pythonhosted.org/"
      else
        ""

[^2]: Option and default are documented w/ storage backends, but not in the list of mirror configuration options. Could be listed with a cross reference?

cooperlees commented 9 months ago

Thanks for digging in.

Presently there is a function in configuration.py that validates most of the options while others are only read in mirror.py. I don't know if it would be a problem to consolidate all configuration loading into one place.

Welcome to move all to configuration if you feel there is an advantage

root_uri's default value interacts with release-files. The description of the default in the docs matches the code, which is in pseudocode something like:

Is there a missing question here? Pseudo code seems right.

Option and default are documented w/ storage backends, but not in the list of mirror configuration options. Could be listed with a cross reference?

Sure - cross reference. But lets avoid as much duplication as possible

flyinghyrax commented 9 months ago

Hi @cooperlees , good to hear from you!

Presently there is a function in configuration.py that validates most of the options while others are only read in mirror.py. I don't know if it would be a problem to consolidate all configuration loading into one place.

Welcome to move all to configuration if you feel there is an advantage

I'll look in to this. I have one idea I'd like to hear your thoughts on. I was wondering if it would make sense to always load the module's default.conf, whether or not there is a user config file, and have values from the user config overwrite/take precedence. That would make default.conf a single source for what the default values are, but it could also mean making default.conf more extensive and less suited for someone first getting started.

Otherwise I'm happy to just consolidate things into configuration.py a bit, but seeing where default.conf gets loaded when there's no file specified got me thinking about trying to expand its use.

root_uri's default value interacts with release-files. The description of the default in the docs matches the code, which is in pseudocode something like:

Is there a missing question here? Pseudo code seems right.

No missing question! I originally wrote the table as a list for my own reference, and when I posted it my footnotes came along too. Glad to hear my interpretation seems accurate. 🙂

Option and default are documented w/ storage backends, but not in the list of mirror configuration options. Could be listed with a cross reference?

Sure - cross reference. But lets avoid as much duplication as possible

Sounds good. 👍

Can't promise a timeline for a PR, but I'm happy I've found something I feel like I could contribute to, so I feel optimistic about finishing this.

cooperlees commented 9 months ago

I was wondering if it would make sense to always load the module's default.conf, whether or not there is a user config file, and have values from the user config overwrite/take precedence.

I like this idea a lot. Please lets unit test all scenarios tho please. Should be able to make configuration.py very clean as it seems you can just call read twice with configparser.

def merge_ini_files(file1: str, file2: str):
    config = configparser.ConfigParser()

    # Read the first INI file
    config.read(file1)

    # Read the second INI file
    config.read(file2)
    ...
flyinghyrax commented 8 months ago

I've opened a draft pull request (#1669). I've made it a draft because:

Working on the mirror documentation definitely increased my understanding of Bandersnatch's behavior with different options, but of course let me know if I've introduced any inaccuracies.

I was wondering if it would make sense to always load the module's default.conf, whether or not there is a user config file, and have values from the user config overwrite/take precedence.

I like this idea a lot. Please lets unit test all scenarios tho please. Should be able to make configuration.py very clean as it seems you can just call read twice with configparser.

The only concern I have thus far is root_uri - I think it needs to distinguish whether it is set in the user configuration file or not, so loading a default value for it can't be completely transparent. I'm going to play around with the idea anyway and see how far I get.

cooperlees commented 8 months ago

I do not know if it is preferable to review documentation updates separately from code changes, or to put them together in one PR.

Let's do the docs first but maybe

I maybe got a bit carried away rewriting/reorganizing and it may be too big of a change relative to the initial goal here.

All good to me - but let's just have it document reality and change the docs to state when we make the default.conf actually be the source of truth for all default options

I've also commented on the PR suggestions for some enhancements.

root_uri - I think we leave it required in the config maybe as I think it's good to make people create their own config file ...

Thanks again.

flyinghyrax commented 8 months ago

All good to me - but let's just have it document reality and change the docs to state when we make the default.conf actually be the source of truth for all default options

This makes sense! That will leave the docs "correct" in the interim while I figure out the configuration implementation bits.

I will go ahead and change the docs PR so the options that are currently missing defaults are marked as required.

flyinghyrax commented 7 months ago

@cooperlees - I've been fiddling around with config validation for a bit and I'm struggling some with how diff-file works.

  1. It looks like it should support diff-file = {{section_option}}/newstuff.txt (unittest.conf, test_storage_plugins.py), but the implementation breaks if there is anything in the value other than the reference, e.g. diff-file = {{section_option}} works but diff-file = {{section_option}}/mirrored-files doesn't. It follows the 'Invalid section reference' path and defaults to '(directory)/mirrored-files'. (This makes it look like it is working if the value happens to be {{mirror_directory}}/mirrored-files.)
  2. The mirror subcommand mirror.py:mirror seems to expect diff_file is optional and has "if diff_file" checks to not write the diff list if there's no path set. But validate_config_values sets diff_file to the empty string, and in the mirror function has diff_file = storage_plugin.PATH_BACKEND(config_values.diff_file_path) before any of the conditional checks, and Path("") evaluates to the same as Path(".") and is "truthy". So it seems like in practice a diff file is always written.

For both of these it is straightforward enough to "fix" them, but doing so changes externally observable behavior in a potentially incompatible way:

Thoughts on a direction to take?

(Also should I open a new issue for this config implementation stuff?)

cooperlees commented 7 months ago

Thoughts on a direction to take?

I prefer to just fix the magic file appearing and mark it loudly in change log. If anyone is relying on that, I feel it's better to break them and make them be explicit in their config file on their needs.

TIL that Path("") == Path(".") and thus truthy. "" is false, so yeah. Fun. I bet this was once a string before I pathlib'ed everything.

Thanks for noting these found bugs and asking questions on fun cases like this.

flyinghyrax commented 7 months ago

Happy to! 👍

And yeah the Path(“”) thing… I didn’t know either, had to try it in a REPL a few different ways to convince myself. 😆