ggmichael / craterstats

Craterstats - a tool to analyse and plot crater count data for planetary surface dating
BSD 3-Clause "New" or "Revised" License
30 stars 12 forks source link

Package Structure #1

Closed rbeyer closed 3 years ago

rbeyer commented 3 years ago

In order to make it easier to package up for PyPI and conda eventually, and also to make other maintenance things easier, you might consider altering the structure (how the files and directories are laid out).

There are some general guidelines and the PlanetaryPy Project offers some concrete examples. There is a PlanetaryPy Package Template, but that's really to help you start a new, empty repo, and uses some templating tools. However, you can take a look at the build branch to see an example of what a good layout might look like.

However, here are some specific suggestions:

Moving these things around will likely result in needing to adjust some (but not all) of the import statements and other minor changes. It is a chore, but will make maintaining this package easier in the long run.

ggmichael commented 3 years ago

Hi Ross,

Thanks for the suggestions. I've had a try - I've got myself into a bit of a tangle with it. Previously, I saw 'craterstats3' as being package-like, in the sense that it contains the functionality that could be imported into another application. 'craterstats.py' imports this library and implements a command-line interface to it. 'gm' is a personal library that I use across different programs.

I rearranged as you suggested, but I'm not sure I can any longer import 'craterstats3' (now 'craterstats') as a package, since now I am inside it. Or perhaps it's just become ambiguous whether I'm asking for the package or the module 'craterstats.py' if I write 'import craterstats'? I have to admit that figuring out imports has been the most difficult thing for me about python!

Do you have any advice? Does it make sense to import a package from within? Is it better to import a series of modules individually?

rbeyer commented 3 years ago

Thanks for the suggestions. I've had a try - I've got myself into a bit of a tangle with it.

Yes, import statements definitely take some effort to get the hang of.

Previously, I saw 'craterstats3' as being package-like, in the sense that it contains the functionality that could be imported into another application.

Yes, this is exactly how you should think of it.

'craterstats.py' imports this library and implements a command-line interface to it.

Right.

I rearranged as you suggested, but I'm not sure I can any longer import 'craterstats3' (now 'craterstats') as a package, since now I am inside it. Or perhaps it's just become ambiguous whether I'm asking for the package or the module 'craterstats.py' if I write 'import craterstats'? I have to admit that figuring out imports has been the most difficult thing for me about python!

Okay, this is the tricky part: craterstats.py is, itself a "part" of that package (because when you eventually distribute it, you want people to get that command-line program), and so must be "inside" the craterstats/ directory.

Indeed, the problem is that there is a name clash between the "package" name, and the name of the file that contains your command line program. I should have realized that earlier. I think there are two solutions:

  1. Your craterstats.py file cannot import craterstats, but must import things the same way that your __init__.py does (e.g. from .Epochs import Epochs). This can be unsatisfying, and may still result in some problems, I'm not sure without testing.
  2. Rename craterstats.py to a different name, so there is no name collision. The name collision comes from the fact that the Python import mechanism has a conflict between craterstats/__init__.py and craterstats.py. Giving it a different name should allow you to import craterstats from the newly-named file.

This second solution seems counter-intuitive, but let me tell you how it would work. Typically, people would name that file cli.py to clearly indicate that it is a command line interface (CLI) to the package/library. Now, you may wonder how would people that use your package be able to run:

> craterstats

to run your cli.py file when they get the package? The answer is in the magic of the Python package management tools. Eventually your repo will have a setup.py file like the example, and it will have this in it:

entry_points={
        "console_scripts": [
            'craterstats=craterstats.cli:main',
        ],
    },

With the above in your setup.py when someone installs the craterstats package via pip or conda install, they will suddenly magically have a craterstats program that does what they expect.

There are fancy things you can do to create a "development" environment for testing (so you have a "craterstats" command line program available while you work), but the easiest thing to do is move your current craterstats.py to craterstats/cli.py (or whatever name that isn't "craterstats"), and then create a new file called maybe cs at the top level of the repo, make it executable with this in it:

#!/usr/bin/env python

import craterstats.cli as cs

cs.main()

Then you can run /path/to/cs which should "execute" your command line program, and will be the most similar to what users will eventually see. That file doesn't need to be committed to the repo if you don't want to.

I know this sounds like a lot of bending around to do what you think you've already got "working" but this arrangement will help your future self out.

'gm' is a personal library that I use across different programs.

Ah. So you could create a separate gm repository and package it as a Python module separate from craterstats and then craterstats would depend on gm just like it depends on numpy.

However, I think at the moment, you can keep it as a part of craterstats. If you end up authoring a bunch of packages that all really do depend on gm, then you can split it out.

Do you have any advice? Does it make sense to import a package from within? Is it better to import a series of modules individually?

That's my advice. If you change the name of the "program file" then you should be able to import craterstats but it is still good practice to only import the things you directly need, and even then use the local from .Epochs import Epochs pattern.

rbeyer commented 3 years ago

Woah. Okay, so all of the above was me shooting from the hip. I decided to sit down with your code, and see if I could do what I described above. And I think I ran into what you were talking about.

This is a pretty big can of worms, so my apologies.

The problem is related to the Python import mechanism, the way you use it, and the way you have things laid out. Turns out, you can keep it named craterstats.py inside of the craterstats/ directory if you want to, but there may still be some gotchas that I didn't run into.

In your gm/ module's __init__.py and elsewhere you made heavy use of the import * mechanism, which brings all of the namespaces up to the top level and then you also have a few places "inside" of the gm/ structure, like file/read_textstructure.py that itself does an import gm which is kind of circular. All of this makes your current gm/ structure kind of fragile, and which is why moving it inside of craterstats/ leads to all kinds of trouble. In general, while developing a package you should strive to not use any import * patterns and only consider doing that at the very end once everything is working to maybe help out users of your module's API.

You also have a profusion of files in the gm/ structure which makes it even more complicated. My suggestion is to collapse all of the files in each sub-directory down to a single file with all of that code in the one file. So in the gm/ directory, I'd create the file maths.py and put all of the functions from all of the files in the current maths/ directory in that Python module, and then delete the maths/ directory. Right now, without all of your import * shenanigans, to get at the poisson() function, it lives at gm.maths.poisson.poisson() which has some repetition. If you put the poisson() function in the maths.py file in the gm/ directory then it would just be at gm.maths.poisson(). Then, for example in your Craterpdf.py file, the import statements might look like this:

from .gm import maths as gmmaths

# ... further down ...
pdf0=gmmaths.poisson(self.k,lam*10**x)

Or you could do this:

from .gm import maths

# ... further down ...
pdf0=maths.poisson(self.k,lam*10**x)

And then if you do this with all of the directories in gm/ some of those will need to reference each other, for example, the string.py module would need to have:

from .maths import sigfigs

# ... further down ...
    return sigfigs(n,sf)

Again, this may seem like more work, but for modules "inside" the package, they should always be using the from .<module> import <submodule> pattern.

This also allows you to delete the gm/__init__.py file, because now you're always reaching to import only exactly what you need. The import * pattern that you have in __init__.py should be avoided as much as possible.

So there are some other consequences of this change.

One is that having craterstats.py import demo.py and then having demo.py import craterstats.py seems like a bad move. However, the demo() function in that file is pretty small, and you can just pull it right into craterstats.py

Another is that you have some hard-coded paths to things, which means that if you put a cs executable file outside the craterstats/ directory, like I suggested above, then all of those paths will need to start with craterstats/, but more importantly, it means that the user's current working directory must always be at the repo level. That's fine for now, but we'll eventually need to make that more generic so users don't have to cd to a particular directory to run the program.

One more thing I noticed is how you use argparse. You have a function parse_args() that instantiates a parser, and then runs its parse_args() function. However, a more robust pattern is to have that function return the parser object. Then you have the first lines of main() look like this::

def main(args):
    parser = parse_args()
    args = parser.parse_args(args)

You could do that all in one line. Anyway, the value of having that function return a parser rather than the args object is that this will help with an aspect of documentation down the line. This documentation page for pvl shows the two command line programs that we provide, and lists their options. We don't write that out twice, there's a module in the documentation generator that can be pointed to a function in our module that returns a parser object, it then uses that parser object to auto-generate the documentation, so we don't have to update it manually. Using this pattern also makes integrating the demo function easier, so inside it can just be:

        print(f'\nDemo {i}\npython craterstats.py '+c)
        main((f.format(i) + c).split())

That was ... a lot. So I have modified the code as I described above and it is working (the demo runs). I would be happy to share that with you so you can see a working example, but I don't want to just rewrite your code on you. If I were in your shoes I'm not sure if I'd want to "do it myself" or see an example so I'll let you pick, and I want to be clear that I'm just offering suggestions that you can take or leave.

ggmichael commented 3 years ago

Hi Ross,

Thanks for all your suggestions! I just completed your previous set, and that got things working again: I updated the repository.

I shall also have a look to sort out this gm structure. I'm not sure about merging the subfolders into single files, since these are only a subset of a much larger number I intend to port as I continue with Python. I would sooner pull out a subset of files than pieces of files for a new project. But I get what you say about the importing gm within the library - I didn't like doing it that way.

I'd be glad to see your changed code. I'll look at the argparse issue, too.

Thanks again.

rbeyer commented 3 years ago

Now it is my turn to be confused. I pulled down your changes, and when I try and run it, this happens:

> python cli.py -h
Traceback (most recent call last):
  File "cli.py", line 9, in <module>
    import craterstats as cst
ModuleNotFoundError: No module named 'craterstats'

I know why this is happening (another import issue), but I'm just curious how you are getting it to run with this new configuration?

ggmichael commented 3 years ago

Ah, ok. I see that now. I ran the demo from PyCharm, which has craterstats/src on the path, so sees the craterstats package.

ggmichael commented 3 years ago

ok - I moved the library code one further down in the directory structure, not unlike how it was. What do you think of it like that? I changed the argparse code as suggested; moved demo inside now-renamed cli.py. The gm package is not causing any problem now (that I see!), but I'll still look at revising those internal imports.

ggmichael commented 3 years ago

sorry - it's still not working... leave it a bit

ggmichael commented 3 years ago

should work now

rbeyer commented 3 years ago

Ah. I see what you did, this is looking good. I've just got a few suggestions that would tweak this just a tiny bit.

I see that you moved things down into craterstatslib/ but you don't need to. Let me take a different stance on imports than I did above. Admittedly switching everything around to the from .<module> import <submodule> pattern is a bit much, and is the stance you'd use to try and always import things from "inside" the module. But we can instead adopt the import pattern that someone "external" to your module would use.

So you currently have this pattern:

import craterstatslib as cst
import gm

The problem is that this pair of lines implies that there is a craterstatslib module and a gm module present in the namespace, and there isn't unless you are exactly in the directory with those two sub-directories.

However, we can change it to this pattern:

import craterstats as cst
import craterstats.gm as gm

So the idea here is that this is what Python users "outside" your package would use to import your functions and attributes, everything starts with craterstats at the root. Just like when you use numpy you start with import numpy and everything follows from that. The other thing about this is that you only need to change the import statements and don't have to mess with the body of the programs. This is also the import pattern that tests which you will eventually write will need to end up following. Making this conceptual change then allows you to bring all the stuff (__init__.py included) back up from craterstatslib/ and then you can delete that directory.

The trick here is then how to "call" the program so that it runs, and also has access to these modules which are "part of" the craterstats module. Before you were having trouble "calling" the craterstats module from craterstats.py because you were "running" craterstats.py from "inside" the directory, and the import mechanism couldn't resolve the name. The solution is to "run" it from outside (a crutch for now, not particularly different from running it from inside, just a slightly better crutch).

So if you make a little script program, and put it in the src/ directory, and run it from there, then the import pattern above will work (because there's a craterstats/ directory that the Python import mechanism can load). More importantly, when you move on to packaging this up for distribution, and using setuptools this is how the eventual craterstats command line program that will be installed when users pip- or conda-install the craterstats package.

The other piece to get this to work is to amend the hard-coded paths that you have. Again, eventually there are ways to abstract those paths, but for now we can leave them hard-coded. For example, you have template="config/default.plt" that would need to change to template="craterstats/config/default.plt". Also, sadly, all of the paths to the sample/ directory in config/demo_commands.txt are going to need to get prefixed with craterstats/. Again this is temporary, and it'll be switched up later.

So my example of all this can be found in this example branch, including a little cs script in the src/ directory. So to run the program you would:

> cd /path/to/craterstats/src/
> ./cs -h

And you can see the differences from your current set-up by looking at the commit diffs.

I then have one remaining question. In your get_parser() function you had this:

def get_parser():
    if "parser" in get_parser.__dict__:
        return get_parser.parser
    #
    # ... further down ...
    #
    get_parser.parser=parser
    return parser

And I'm not sure what the intent was here? Commenting it out didn't break anything, so maybe this was just leftover from some experiments? Maybe you were trying to save the work of re-building the parser during your successive demo runs, but Python is pretty good about optimizing that under the hood, and I don't really think you'll see a performance hit.

ggmichael commented 3 years ago

Thanks, Ross. That looks good - I'll try it. Thanks for the github demonstrations, too - that's helpful to see. You're right - the parser code was to avoid rebuilding the parser. I have to work on some other stuff for the next few days... I'll come back to this end of next week.

ggmichael commented 3 years ago

Hi Ross, I included all the changes you suggested - that works.