thinkingmachines / geowrangler

🌏 A python package for wrangling geospatial datasets
https://geowrangler.thinkingmachin.es/
MIT License
47 stars 14 forks source link

Prevent unexpected bugs from function definitions #237

Closed tm-danna-ang closed 1 month ago

tm-danna-ang commented 1 month ago

Describe the bug

More detail described in this blog post, and here. Recommended approach is to use None in the default value an then define the mutable object (in this case dictionary) within the function

The variable extra_args is local to the function in terms of its scope. However, the mutable default object (the dict) it refers to is created once when the function is defined and persists as part of the function object. This makes the default object behave unlike typical local variables, as it retains its state across multiple calls to the function. Thus, while extra_args is a local reference, it points to a persistently existing mutable object that is tied to the function's definition

Very slight nuances in mutating a variable vs assigning a variable sample here from a reddit thread

def func(a=[]):
a = a + [3]
return a

func()
> [3]
func()
> [3]

def func(a=[]):
a += [3]
return a

func()
> [3]
func()
> [3, 3]

Additional context

butchtm commented 1 month ago

hi @tm-danna-ang this should be something we need to review on a case-to-case basis.

The downside of using None as the replacement for the default value is that it does not provide a clue as to what to override it with especially when you have tools like autocomplete/tab completion as well tools that display the method signature when you hover over the method name.

See example for the raster zonal stats method (on Colab).

image

Replacing the extra_args with None would no longer provide a clue as to what you can replace it (unless you go in and read the source or provide more comments in the doc string).

As for the downside of the possibility of propagating errors -- this only occurs if you are mutating and returning the parameters (which extends the lifecycle of the parameter object beyond the scope of the function call) which might not be the common case. I can see the rationale for the suggested rule and the potential for error, but I don't think it is a general rule that we should apply.

If you look at the source code for create_raster_zonal_stats or even create_exactextract_zonal_stats method, this kind of "caching" never occurs or has no side effect as the extra_args is never part of the returned values. (its only used via **extra_args in the call to the underlying library.

So I would hesitate making this a convention -- I would rather check that the mutable parameter is not being returned so that there are no unintended side effects.

joshuacortez commented 1 month ago

Hi @butchtm thanks for pointing this out! Agree that using a dictionary for extra_args might be better for this case

tm-danna-ang commented 1 month ago

thanks for the insights @butchtm and @joshuacortez! will scan through the code if there are cases when a mutable argument is returned and will close this Issue if otherwise.

tm-danna-ang commented 1 month ago

did a quick check and did not find any since these cases were only for specifying aggregations and extra args which were not returned in the end. closing this Issue! thank you!