mdhaber / scipy

Scipy library main repository
http://scipy.org/scipylib/
BSD 3-Clause "New" or "Revised" License
1 stars 5 forks source link

ENH: pareto.fit system of derivatives #68

Closed swallan closed 2 years ago

swallan commented 2 years ago

Not as polished as I would open to SciPy, but it works. There's no seed set on pareto test, but I've left it like that for now because there's this instability.

Reference issue

What does this implement/fix?

Additional information

swallan commented 2 years ago

edit: this issue is resolved


This "nudging"

                if not ((fscale + floc) <= np.min(data)):
                    fscale = np.nextafter(fscale, 0)

seems to work well for a lot of the previous bad seed / parameter combinations. However I think that the nudge isn't quite far enough all of the time. For example, with this set of parameters

np.random.seed(2535619)
b, location, scale = 1, 0, 1

I get a familiar problem:

mle_analytical: (0.8022234054216126, 0.5373550503158566, 0.4678392102380698)
numerical_opt: (1.1494928912047517, 0.00014973594798414286, 1.0050445229214544)
ll_mle_analytical: 71148.7782309058
ll_numerical_opt: 173.73860454075663

The resulting nudge actually requires four more nudges to be be within support:

>>> shape, location, scale = (0.8022234054216126, 0.5373550503158566, 0.4678392102380698)

>>> (loc + scale) == data.min()
True

>>> pareto.pdf(data.min(), shape, location, scale)
0.0

>>> while (loc + scale) == data.min():
>>>     print("+1 nudge")
>>>     scale = np.nextafter(scale, -np.inf)
+1 nudge
+1 nudge
+1 nudge
+1 nudge

>>> (loc + scale) == data.min()
False

>>> func((shape, location, scale), data)
170.72730971692

# and 
>>> pareto.pdf(data.min(), shape, location, scale)
1.7147417058381749

Should I implement it in a loop to check whether the nudge was enough to satisfy the constraint?

swallan commented 2 years ago

Interesting that the CI for Meson is running on here even though it's not the main SciPy repo. Either way, the test case I set up to fail failed!

mdhaber commented 2 years ago

Should I implement it in a loop to check whether the nudge was enough to satisfy the constraint?

Initially, yes. Go ahead and implement this idea to see if it always works.

Then, it would be even better if we could figure out something that works in one shot. Could you look into how rv_continuous decides to return 0 for the Pareto PDF? Pareto doesn't have _argcheck so I'm wondering how it happens. Maybe if we understand why it fails we can figure out something that would work.

swallan commented 2 years ago

In rv_continuous.pdf, it first scales the data according to x = (x - loc)/scale, and then retrieves the support (a, b) for the underlying distribution based on the shape. Then it checks if x is outside of the support ((a <= x) & (x <= b)). This is used as a mask. There is an output array already filled with zeros, and elements in x that violate the support are left unchanged with the mask. That's how it tells us that there is no chance that those elements could be produced by the distribution.

swallan commented 2 years ago

Once I opted to ensure that (fscale + floc) is strictly less than data.min(), hypothesis testing can't find any more of these cases. So while the condition technically may be that fscale + floc <= min(data), the numerically stable version is that fscale + floc < min(data). I'm going to have hypothesis vary the other parameters with this method to see if there's actually any cases where more than one iteration is needed, but so far I haven't found any.

It's a little bit strange because of the behavior I observed above, which I can still reproduce, where more than one nudge was needed to satisfy the constraint. I think it is some sort of floating point behavior:

In [92]: location + .4678392102380698 == data.min()
Out[92]: True

In [93]: location + np.nextafter(.4678392102380698, 0) == data.min()
Out[93]: True

They are still equal after the nextafter, but the original scale violates support and nextafter(scale, 0) does not:

In [94]: func((0.8022234054216126, 0.5373550503158566, np.nextafter(.4678392102380698, 0)), data)
Out[94]: 170.72730971691993

In [95]: func((0.8022234054216126, 0.5373550503158566, .4678392102380698), data)
Out[95]: 71148.7782309058

So really the fix is conducting a nextafter even if scale + loc already is equal to min(data). I suspect it is returning true because they are still equal to a certain number of decimal places, and the difference between scale and np.nextafter(scale, 0) is so minute:

scale - np.nextafter(scale, 0)
Out[106]: 5.551115123125783e-17

It's quite interesting what can happen when differences are this small:

In [109]: scale == np.nextafter(scale, 0)
Out[109]: False

In [110]: location + scale == location + np.nextafter(scale, 0)
Out[110]: True
swallan commented 2 years ago

I merged SciPy main into this to fix a meson build issue, so all those commits appeared here.