ssec / sift

A visualization application for satellite imagery
http://sift.ssec.wisc.edu/
GNU General Public License v3.0
46 stars 14 forks source link

Replace `AreaDefinitionsManager` class with an instance #346

Open djhoese opened 1 year ago

djhoese commented 1 year ago

Unless there is a good reason, I don't see why the class in uwsift/model/area_definitions_manager.py was made a global class that gets class attributes updated instead of a single instance of that class that gets internally updated.

CC @ameraner @arcanerr if you have any insight on this.

djhoese commented 1 year ago

Bottom line: I'm annoyed at the amount of at-import stuff happening.

arcanerr commented 1 year ago

Hi David, I see your point, that initializing the AreaDefinitionsManager directly at import is not that nice.

Having everything in class methods is a way of avoiding the need to manage a singleton object (including to assure, that the class is not instantiated twice), while still ensuring that there is a common view of the available area definitions. I don't see why this would be a bad decision per se - except for what you point out about where the class is initialised.

An alternative approach to your suggestion could be to make AreaDefinitionsManager.init_available_area_defs() protected/private, add an is_initialized() check at its beginning and call it first in each of the AreaDefinitionsManager query methods. Then you get lazy initialization and still no need to handle a singleton, right? Or AreaDefinitionsManager.init_available_area_defs() could be called once wherever you would otherwise instantiate the single(ton) instance.

The chosen approach should have made the class easy to use anywhere in SIFT, just importing would give a ready initialized interface to the desired queries, no need to worry about initialization order. But when now thinking about it I guess, the class would be re-initialized at every import the way it is done right now. Here I possibly think still to much in C++...

So, I think, you got a point here. At the end of the day, it's your's and EUMETSAT's decision.

djhoese commented 1 year ago

Thanks @arcanerr. In general, things happening at import time makes it harder to test the package. Complex global "constants" that are accessed via import make it hard to test too. When looking at the code out of context I see the singleton-like usage, but ultimately a class that is entirely class methods and never gets initialized is not what I am used to seeing and has a "code smell" to it. In this case, with the same behavior, I think I would prefer a single class instance declared at import time and optionally have area/file loading delayed for the first usage of it. There is no reason to prevent the class from being instantiated multiple times since it is only used from this single global instance/class by the rest of SIFT.

Another "simple" alternative would be to wrap some of these things in an lru_cache or similar decorator and treat it as accessing cached area definitions rather than explicitly managing a container of the areas. Not great, but an option.

I think my long term hope, but admittedly harder to implement design, would be to have a single instance created when SIFT is started and gets passed where it is needed. This of course requires passing the manager or access to the manager to various places which can be difficult to keep clean, but that's the challenge of any good software system. I'm not going to pretend the old SIFT or the timeline of either UW or EUM/ASK development allowed for this type of more complex design, but I think it would benefit the project in the long run.

But when now thinking about it I guess, the class would be re-initialized at every import the way it is done right now.

Python will "cache" imports and realize that it has already imported the module once and not re-run it. This isn't true for multiprocessing where each separate Python interpreter instance will re-import all modules, but that's not a big deal for us right now.

arcanerr commented 1 year ago

Hm, you think, a class that is not instantiated has a code smell :thinking:.

Actually, the fact that the area definitions are cached internally is currently completely transparent to the client of the class. So nothing needs to be passed to any caller of the functionality, the caller just calls a function. To the caller it looks like a library. This is good, I think.

Ok, what I did here is a bit like "abusing" the class for having a namespace AreaDefinitionsManager with library functions that access a protected/hidden cache/registry that should only be instantiated once at best and only accessed via the functions (=class methods).

What would be a better way (not too much extra code, easy to test) to express this in Python?

What currently makes testing this less easy than it could be is that the initialization function does several things in one:

  1. read from the config which area definitions should be made available
  2. cache (register) those from the list from (1) that are available in pyresample
  3. Make sure that at least one area definition is available.

This could be split to omit (1) in a test and include a list of area definitions to be registered in the test code. But (I keep thinking) since the config can easily be filled programmatically, even this does not make much difference.

djhoese commented 1 year ago

What would be a better way (not too much extra code, easy to test) to express this in Python?

I think just creating a regular instance right above the init_ call:

area_manager = AreaDefinitionsManager()
area_manager.init_available_area_defs()  # or move this inside the '__init__'

In most normal runtime uses we can assume this code, even at module level, will only be called once.

But (I keep thinking) since the config can easily be filled programmatically, even this does not make much difference.

The config thing is one of my other difficulties with this code and some of the code in uwsift/__init__.py where uwsift.config.get is called at import time. Although not used heavily or ever in SIFT, the config stuff is supposed to be dynamic so we should load/get these values as late as possible. We want to avoid ignoring runtime changes to the config, but also don't want to suffer a performance penalty by reloading things at every request. In Satpy we have some old code that we are slowly migrating to behave more like this where they call a function like get_preferred_chunk_size() at runtime every time it is needed. We also have parts in Satpy that have a custom version of the @lru_cache decorator which uses some values from satpy.config.get as part of the cache key/ID so we automatically re-compute/load things when the config changes but cache it otherwise.

I could also see something in between this and possibly in pyresample or satpy where YAML strings or filenames to YAML filenames are cached with the resulting area definitions. From the "user" (SIFT) point of view it looks like areas are being cached per-area, but they are actually per-file/string of areas.

arcanerr commented 1 year ago

I still don't see an advantage to create AreaDefinitionsManager instance(s), which you have to manage and pass around between modules in case you want to query for area definitions in different source files. The idea was to have one global source of truth for asking anywhere in SIFT: Which area definitions have been registered?

In particular, if you want to reload configurations or otherwise dynamically change the available area definitions, this becomes more complicated (at least not less code) if you have a (or worse: several) instance(s) of the class.

Or would you leave the "registry" (those two reciprocal mappings) as the two class variables (i.e. not created in __init__)? But what would the instance be good for if it doesn't contain any instance-specific state? After all, in Python a class is already an object.

Regarding the interplay between config and AreaDefinitionsManager: init_available_area_defs() could be changed to be

  @classmethod
  def register_area_definitions(area_definitions_names: List[string]):
      ...

This completely decouples the AreaDefinitionsManager from the config. A convenience classmethod of the AreaDefinitionsManager could implement how to get these area_definitions_names from the/a config object, but such a method could also be placed elsewhere. I'd prefer each logical unit to know about its configuration itself, but I see reasons to manage this more centrally.

Long story short, as you can see, I would still prefer the interface of the current implementation, which I think is easier to use. But of course it is up to you which design you prefer.

I'd like to add: I'm not familiar with the @lru_cache decorator, I've only had a cursory look at it. It may make all my thoughts above obsolete :wink: .

djhoese commented 1 year ago

you have to manage and pass around between modules in case you want to query for area definitions in different source files. The idea was to have one global source of truth for asking anywhere in SIFT

becomes more complicated (at least not less code) if you have a (or worse: several) instance(s) of the class.

There would never be more than one instance of the class, just like there is only ever one Workspace, one Document, one SceneGraphManager, one database session manager (at least in the old UW version of SIFT). The benefits are:

  1. Explicit high-level declaration of what pieces of code use/access this registry. This is done by passing access to the registry or by passing the contents of the registry to the parts that need it. For example passing a list of available areas that can be requested/used later versus passing the entire registry manager/object.
  2. Avoid import-time side effects, generally making testing easier via things like dependency injection.
  3. In my opinion, Python code should avoid global objects as much as possible. At least that is what I would consider standard/common practice. Constants, sure. Complex objects that read from disk, are mutable, and have multiple ways of accessing them, not really.

With your example of wanting to reload or dynamically change any part of the registry, why would we want anyone (any piece of SIFT) to have the ability to do this? That just makes debugging harder as the registry could be modified from anywhere. If the registry/manager object is "owned" by one high level class then it is very clear who should be doing the updating/modifying (read-write) and everything else should be accessing (read-only).

Or would you leave the "registry" (those two reciprocal mappings) as the two class variables (i.e. not created in init)? But what would the instance be good for if it doesn't contain any instance-specific state? After all, in Python a class is already an object.

Why do they need to be class attributes at all? Why can't they be in an instance? I'm not sure I see the benefit of the global state. I can agree that there is an ease of access with just importing a global collection of these things, but I don't agree that that makes it the best long term decision for the rest of the code and the overall design. At that point, why isn't everything a global instance that gets imported where it is used?