lahwaacz / wiki-scripts

Framework for writing bots, maintenance scripts or performing data analysis on wikis powered by MediaWiki
http://lahwaacz.github.io/wiki-scripts/
GNU General Public License v3.0
27 stars 12 forks source link

ws/config.py: a replacement for the third-party modules #69

Closed agalileev closed 3 years ago

agalileev commented 4 years ago

Hi. I made some changes for #66. ConfigArgParse was replaced with argparse, and configfile - with configparser.

After some thoughts, I decided to limit the changes and do not rework the class / function interfaces where possible. I don't know how to make tests on changes properly. Are you using unittests or something else? How do you make the "production" testing?

And yes, the syntax of the config file had to be changed to become compatible with the configparser module. An example can be seen in the file examples/sample.conf.

lahwaacz commented 4 years ago

As far as I can tell, the --site parameter would not make sense anymore, because you can't have multiple site-specific options or even site-specific sections in the config. That's quite significant feature loss. If you can't have site-specific sections, maybe it would be better to implement at least site-specific config files.

agalileev commented 4 years ago

That's quite significant feature loss.

Is it? Eventually I have found the use case for --config (e.g., when you want to test the new config without replacing the one on the standard config filepath), but --site is looking quite useless.

  1. When we deal with the config file, we work on script-specific sections. The latter inherits all options from [DEFAULT] section including site-specific ones. This approach is straightforward. You do not need anymore "top_level_arg" manipulations in your code which results actually to nothing (no better usability, no better config management, etc.)
  2. I have thought about multiple site-specific config file management functionality, but... Do you use wiki-scripts on non-ArchWiki sites? Do you know anyone who uses it on two or more wiki projects simultaneously? I assume the answer is "no" for both questions. :-) If someone will use wiki-scripts on another site in future, he takes the config file, replaces the site-specific options, replaces script subsections and just runs the script. Simple and easy-to-use.

The old --site functionality cannot be restored (because configparser does not support such config file syntax), so I am ready to implement the functionality of multiple config files management if you are sure that this is really necessary, but it looks like an overkill to me.

lahwaacz commented 4 years ago

Do you use wiki-scripts on non-ArchWiki sites?

Yes - for example a wiki on localhost which I use for testing some stuff...

agalileev commented 4 years ago

Hm, ok, I shall do it. And about testing - do you make any non-standard (I mean non-unittest) testing?

lahwaacz commented 4 years ago

Depends on what you consider non-standard. There are "standard" unit tests for some submodules using pytest, some more involved tests which actually require internet connection to ArchWiki (they are currently disabled and probably broken) and some very involved feature tests for the database synchronization part (they spawn a separate PostgreSQL process and a web server hosting a local MediaWiki instance) which never worked reliably and are also disabled. Besides that, most of the features were implemented by just running the scripts in interactive mode and observing the behaviour.

agalileev commented 4 years ago

Hi. After hard painful thoughts I came up with an idea how to cross --config, --site and working with multiple config files. The two options (config and site) have been merged with the following implications:

  1. There is only one hardcoded directory for configuration files now (default to ~/.config/wiki-scripts).
  2. The following naming scheme for config files are used: $config_dir/<name>.conf.
  3. You can specify the config file you need by --config option: ./script.py -c <name>.

For example, you have the following config files in your config_dir:

localhost.conf
very-test-config.conf
archwiki1.conf
archwiki2.conf

Now you can choose a config for a script just with ./script.py -c very-test-config, ./script.py -c archwiki1, etc.

The benefits of such approach:

  1. No longer needed to specify /long/configuration/file/path in --config option. The config names are much more convenient. If you type the name which is not valid the FileNotFoundError will be raised (see ws.config.argtype_configfile() function for details).
  2. Multiple configuration files can be stored in one place and simply managed.
  3. Also multiple conf files for one site can be stored - just name them differently!

ws.config has been largely rewritten without any testing, so it potentially has the bugs. On this stage I just want your opinion about the concept.

lahwaacz commented 4 years ago

Nice, I agree that merging --config with --site makes sense, but how do you determine the default <name>? I don't think it should be anything related to Arch wiki, as that would be inconvenient for people who want to use wiki-scripts primarily with some other wiki. I think it could be just default, so people would either use default.conf only or they could make it a symlink to their primary config file. What do you think?

I think it would also be good to interpret the value for --config two ways: first check if there is a <name>.conf in the config directory (the way you're doing now), but if it fails, check if it is a path to an existing file. Using relative paths is pretty short and convenient with tab completion. This way it would be possible to e.g. test the example config from the repo with --config examples/sample.conf without having to copy anything or doing something weird about the config path where <name> is looked up.

agalileev commented 4 years ago

Ok, I have implemented two suggested changes. Unfortunately, I did not manage to stop in time and made some other changes :). I decided to split functionality of argtype_* between the type= converters and action= actions. It seems to me that it will be right if type= will do only value conversion, and action= will do further transformations and validity checks.

New types:

New actions (they are registered and can be used with aliases shown below):

As for argtype_comma_list_choices, I brutally dealt with it. :) It looks like a nasty hack, but this task can be done much simpler. The only precondition is to use the whitespace as the option values separator:

# in command-line
$ ./toc.py <some options> --toc-languages en ru fr <other options>

# in configuration file
[toc]
toc-languages = en ru fr

See toc.py changes in cb26980 for more details.

It seems that I changed everything that could be changed in ws/config.py. :) If you have any suggestions or feature requests - I am ready to deal with them. "Please, give me that task. I hope that is the task!" (c) BlackTeaHamburger. If no then I proceed to testing, debugging and writing docs.

agalileev commented 4 years ago

Hi. I am debugging ws/config.py right now by running example scripts. And yes, I've already fixed the quotes. :) Some of your requests have already been resolved, but I will comment on every of them.

agalileev commented 4 years ago

I move on to writing tests. Since I haven't worked with Pytest before, this may take a while. If you have new ideas on how I can make the world a better place, just let me know. :)

agalileev commented 4 years ago

Hi. I have wrote a bunch of tests for ws.config module. Two notes:

  1. argtype_config() was refactored. Config filename transformation code was moved to a separate private _get_config_filepath function for better testing.
  2. I do not know how to test getArgParser() and object_from_argparser() functions properly. They are too complex.
lahwaacz commented 4 years ago

Sorry, I won't be able to review your work this week. I'll try to get to it next week, otherwise feel free to ping me if nothing happens...

agalileev commented 4 years ago

No problem, I have an interesting task right now that consumes all my time. Tell me when you will ready to resume.

lahwaacz commented 4 years ago

@agalileev Sorry for the delay, I'll review your work today.

agalileev commented 4 years ago

Hi. I have a problem with tests after 08e6df3f. All tests for _get_config_filepath() are (expectedly) failed now. As I had told before, the only purpose of moving this piece of code to a separate private function was a better testing. We can test string-to-path-conversion functionality separately from file-existance-checking one. It is much more simpler than to test the relatively complex argtype_config() function entirely.

lahwaacz commented 4 years ago

Oh, I forgot about the tests - just fixed it. I'll try to add some tests also for the getArgParser and object_from_argparser functions. I also noticed some other small problems that need more investigation, otherwise it looks pretty good.

lahwaacz commented 4 years ago

Hi, I've added some tests for the object_from_argparser function. Unfortunately, 2 tests are failing - please see if you can fix it.

agalileev commented 4 years ago

In 2e0bcbb I have fixed your changes in 08e6df3. I do now know why the absent default config must result in args.config=None. If the file does not exist (either default or custom) the error must be raised. Instead I add a special case --config NONE (NONE string is case insensitive) if you want to work without config file deliberately. Now the argtype_config() workflow is:

  1. If -c NONE was specified, then args.config=None is set.
  2. If -c <any_string> was specified, then corresponding file is checked for existance.
  3. If no --config option was specified, then default config file is checked.

I mark one test as xfail because it is checking non-existent functionality now. A few tests for argtype_config() must be added, I shal do it tomorrow.

About object_from_argparser() tests failing. I have found, that we must parse option only ones. Otherwise anyone option (either global from ws.config or local from klass.set_argparser()) with required=True parameter may break the parsing. The problem is that we need to make initial parsing for --config option, then find a config file and make parsing again. I found a not very pretty workaround to parse sys.argv for --config manually. I am not sure that it is the best possible choice, but it looks like argparse has no suitable functionality for this case. See 8a04e49 for details.

I also add allow_abbrev=False to ArgumentParser defaults.

lahwaacz commented 4 years ago

The behaviour in https://github.com/lahwaacz/wiki-scripts/commit/08e6df3f83e9a795a06689af0fdf494f5f2cba8d is intentional, please don't change it. If somebody wants to use a very simple script, e.g. print-namespaces.py, all that is needed is to supply a couple of options (e.g. --api-url) and it does not make sense to force the user to specify --config NONE or to create the default config file.

As for the other commit - it might be better to create a separate ArgumentParser instance, define just --config and use it for initial parsing. This should ensure that the initial parsing behaves like argparse - now if you pass a wrong file to --config, you'll get an exception traceback instead of a brief error message.

agalileev commented 4 years ago

The behaviour in 08e6df3 is intentional

This is a mistake. Imagine a situation, when the user runs a script and wants to use the default config. He does not specify --config option. But for some reason default.conf is absent or is a broken link and the user do not know about it. Normally, the error should be raised, but... nothing is happend. args.config just quietly set to None and the script proceed! Actually, it can even work without fail, produce some (possibly mistaken) output and make some (possibly mistaken) edits on Wiki. And all this time the user has no any sign that something goes wrong! Now imagine how much effort it will take to find the cause of the error when it will revealed?

If you want to work without config file you must specify this explicitly. I have seen two possible implementations for this behaviour in another applications: either --no-config option or a special case for standard --config option (like --config NONE).

it might be better to create a separate ArgumentParser instance

Yes, it can be a solution. I need to make some investigation.

lahwaacz commented 4 years ago

No, it is not a mistake. It should work without forcing the user to create a default config file - that just doesn't make sense. If you want, you can raise the exception when default.conf is a broken symlink. To explicitly skip an existing config, I'd like --no-config more than --config NONE.

agalileev commented 4 years ago

I argee that creating a config as a precondition to a new user may look a bit weird, especially if he creates only one script that has no need in it. But your changes will result in undefined behavior if the --config option is not specified. In fact, there are two possible behaviours for every script without --config: to use default.conf or do not use any config at all. To determine which of them will take place you must check manually your default.conf every time you run any such script.

For example, you have three scripts to run in series:

./script1.py <options>        # uses default.conf
./simple-script.py <options>    # does not use config file
./script2.py <options>        # uses default.conf again

You need default.conf for script1 and script2, but what will you do with it when ./simple-script will be runned? Delete default.conf before and create it after? Do you really want to do such kind of work? And what if one day you delete default.conf for simple-script and forgot to restore it before script2 is runned?

Actually, 08e6df3 kills the "default config" idea as it was planned initially. Are you afraid that user will have to type an additional --config NONE option? Okay, but now user must type explicit --config archwiki (for example) for every script that intend to use config file to guarantee right behaviour.

I'd like --no-config more than --config NONE

This can be implemented easily with parser.add_mutually_exclusive_group() for --config and --no-config options, but we need to work out the final concept of --config option first.

agalileev commented 4 years ago

I think, we can use a more verbose error message as a possible solution, for example:

file does not exist or is a broken link: '<path_to_config>'. Create the config file or use '--no-config' option.

This a) explains the new user what he have to do; b) do not produce dangerous behaviour.

lahwaacz commented 4 years ago

There is no undefined behaviour in the case I intend: if --config is not specified, the behaviour depends on whether the default config exists or not, but the behaviour is well defined in either case.

Let me rephrase again. The intention is to have a shared config for all scripts, which supports sections for per-script options. So to specify options only for specific scripts, they need to be in a specific section in the config file - there is no danger of unintentionally changing the behaviour of other scripts. If you put them in the [DEFAULT] section, I think the section name is good enough to indicate that such options affect all scripts. This is more or less the same behaviour as there is now, I don't want to change that.

agalileev commented 4 years ago

the behaviour depends on whether the default config exists or not

But this is exactly a problem! Look:

$ ./script.py --cache-dir <some_dir> --api-url <some_url>

A question: "Does this script use defautl.conf?" An answer: "I do not know. May be yes, and may be not. It depends on whether the default config exists or not. May be there are another options in klass.set_argparser() that must be read from default config, but if there is no default.conf on filesystem they wil be set to default values, which are possibly mistaken for this run. We have no guarrantee that the script will behave as we intend because we do not now is there default.conf on filesystem or not".

Another example:

$ ./script1.py --cache-dir <...> --api-url <...>
$ ./script2.py --cache-dir <...> --api-url <...>

A question: "Can you determine with this two strings (without checking the filesystem for a file), what script is working with default.conf, and what is working without any config?" An answer: "No, I cannot"

Do you see? This is what I mean "undefined behaviour". If we have two separate logics on "--config is not specified" case then one command can do two different things depending on default.conf presence. This is the source of big problems in the future. You will have to constantly check for the file if you have two types of scripts and want to guarantee no mistakes.

And now look for an alternative:

$ ./script1.py --no-config --cache-dir <...> --api-url <...>
$ ./script2.py --cache-dir <...> --api-url <...>

Here we can say with confidence:

This is what I can call "well-defined behaviour" :)

lahwaacz commented 4 years ago

If the user can't remember if their config file exists or not, they can't expect anything from the configuration system. If you want a nicer way to check which values apply for a given script invokation, you can try to reimplement ConfigArgParse's print_values function and bind it to an option such as --print-values. But I don't find this feature necessary.

As for "constantly checking for the file": of course the user will never do this, they will use the defaults first (which will either succeed or fail), then they might pass a cli option or create a config file for the script to change the default. But this is not a problem, because if you configure the script in the proper config file section, it will not affect other scripts in any way. Again, it is the user's responsibility to write the config in such a way that they don't get surprised in the future.

Please remember that this PR is about replacing third-party modules, not about major reimplementation of the configuration system.

agalileev commented 4 years ago

Okaaay, I roll changes back to 08e6df3. Just for the record — this is a mistake :(

I need more time to fix "args.config parsing", but for now all tests are passing.

In our discussion it sounded indistinct, so I will clarify — do we need in --no-config option implemented or this will be a major reimplementation of the configuration system too?

lahwaacz commented 4 years ago

In our discussion it sounded indistinct, so I will clarify — do we need in --no-config option implemented or this will be a major reimplementation of the configuration system too?

I don't consider this to be major, so I'll leave it up to you.

agalileev commented 4 years ago

Hi. I have tried some ideas how to handle args.config parsing without an error. The core problem was that if we use two fully separate parsers than the options only from the first of them gets in --help output — but if we use the single parser, any option with required=True can break the parsing.

I have found that the problem can be solved if to inject config parser into the main parser with the parents= parameter. I have no idea how (and why) this works, but it works... The tradeoff is that getArgParser() has the interface changed. It returns a tuple of two now. The first element is the main parser, and the second is a new "parent" parser with the only --config/--no-config options group.

If the script uses only object_from_argparser() it will work without changes. If the script uses getArgParser() it must be changed. The string

argparser = ws.config.getArgParser()

must be replaced with

argparser, _ = ws.config.getArgParser()

I have changed examples/parse-config.py script for example.

May be I have something missed, but all tests are passed.

lahwaacz commented 4 years ago

Hi, good job. Actually we don't need to use parents=, only add_help=False for the auxiliary parser with the config arguments.

I also moved some things around and split the config file parsing from object_from_argparser into a separate function parse_args, because I realized that there are scripts which use plain getArgParser without object_from_argparser and the config file would not be parsed by those scripts.

There is still (at least) one behavioral difference, though: config files should be allowed to contain unrecognized arguments/variables, but unrecognized arguments specified on the command line should be reported as errors. I've added two test cases for this, one of them fails.

agalileev commented 4 years ago

I have added the check for unrecognized options in cli_args. But there is one case which is not covered by this logic: when the multiple short options are joined together with a single - prefix. argparse manages such cases well, but checking these options manually will result in a big piece of code.

Moreover, such options can produse a separate error message. Consider parser with only defined -a option which is not taken explicit value from the command line (e.g., action="store_true"). Parsing args, remainder = ap.parse_known_args(["-abc"]) produces an error:

error: argument -a: ignored explicit argument 'bc'

It does not perceive b and c as options at all and tries to treat bc as an explicit value for -a option.

agalileev commented 4 years ago

Theoretically, the problem can be solved with two-step parsing. We can parse config_args with the parse_known_args() which do not check for unrecognized options, and the parse cli_args with parse_args() (which do). It will be "a right way". But here we will face with the old problem: any required=True option with high probability breaks parsing and the whole script.

I am not be able to investigate this problem until next Tuesday because I have some jobs I must to do (including monthly syncing of AW articles). After that I will be ready to fight this PR until death, its or mine :).

lahwaacz commented 4 years ago

But there is one case which is not covered by this logic: when the multiple short options are joined together with a single - prefix. argparse manages such cases well, but checking these options manually will result in a big piece of code.

I think it can be handled quite easily, since short options are not allowed in config files, so any short option remaining after the parsing means an error. It is just a matter of figuring out the right error message (and checking that short options in the config file are actually not possible).

I am not be able to investigate this problem until next Tuesday

Take your time, there is no rush - I'm quite busy these days too...

agalileev commented 4 years ago

Hi. It looks like I understand how to work with multioptions. Consider such case:

_, remainder = ap.parse_known_args(["-abc"], namespace=args)
  1. When argparse sees multioption block, he tries to recognize the separate options it consists. If the first of them (-a) was defined earlier argparse makes parsing as usually, with assignments or raising errors. We cannot do something here because this is argparse's internal logic.
  2. If -a was not defined argparse places the whole block to remainder as unrecognized. Here we can catch it with if item.startswith("-") and item in cli_args statement and make some error handling.

I have also added the check for short options in a config file and some tests.

agalileev commented 4 years ago

Hi, Jakub. Do you have any update for this pull request? I will have a lot of free time this month, so if you have any ideas or suggestions I can work on them.

lahwaacz commented 4 years ago

Hi, sorry I did not find time for this in the last month. I'll try to look at it at the weekend.

agalileev commented 4 years ago

Ok, look. The main purpose that I intended with this PR was to get practice in Python programming on real project, and more or less, the goal has been achieved. If you are not very enthusiastic at the thought of merging this PR (for whatever reason), then there is no need to continue. We can simply close it to save our time for other jobs. :)

lahwaacz commented 4 years ago

No, just closing the PR would be a waste of our already spent effort. It's not about enthusiasm, it is a good idea and I actually like your changes and the general direction this PR took, but I simply do not have as much free time these days as I had in the summer.

I think that the code is mostly finished, but before merging we should write some brief documentation and/or a migration guide for other users (they are not many but we should still be nice to them). Or at least make a "beta" release or ask them for testing it or something like that. What I mean to say here is that the last step is more than just clicking the merge button.

agalileev commented 4 years ago

Ok, then I shall wait. Tell me if I can do something to help.

lahwaacz commented 3 years ago

This took much longer than I'd like to admit, but it is finally merged! I apologize for letting you wait so long...

I've released version 1.3 just before the merge, so we can polish the details and documentation for the next version (which will be probably 2.0).

agalileev commented 3 years ago

Oh my!.. This day has finally come. It looks like I shall drink vodka for a month or two...