glotzerlab / freud

Powerful, efficient particle trajectory analysis in scientific Python.
https://freud.readthedocs.io
BSD 3-Clause "New" or "Revised" License
262 stars 48 forks source link

Document behavior when data includes np.nan? #1251

Open doronbehar opened 2 months ago

doronbehar commented 2 months ago

Description

Document what happens when data includes a np.nan.

Motivation and Context

So I wanted to use this example from the docs:

box = [100, 100, 100]
points = [[-1, 0, 0], [0, 0, 0], [2, 0, 0]]
query_args = dict(mode='nearest', num_neighbors=1, exclude_ii=True)
list(freud.locality.AABBQuery(box, points).query(points, query_args))

Putting a np.nan in one of the points, produced a result as expected result:

>>> points = [[-1, 0, 0], [0, np.nan, 0], [2, 0, 0]]
[(0, 2, 3.0), (2, 0, 3.0)]

I had to run that little test though to find that out, and it would have been nice if it was mentioned even with an example.

tommy-waltmann commented 2 months ago

Hi @doronbehar ,

Generally, freud takes a "garbage in, garbage out" approach, meaning it makes no claims about what will happen when it is given invalid data. This approach is what is generally used throughout the python ecosystem, as python isn't a statically-typed language and therefore provides no type-checking natively.

A good rule of thumb is to not give freud invalid data and avoid this situation entirely. Does your use case prevent you from validating the data before using it as an input to freud?

doronbehar commented 2 months ago

Does your use case prevent you from validating the data before using it as an input to freud?

For sure I can accommodate to your policy using something as simple as xNoNan = x[~numpy.isnan(x)]. I was just thinking though, that it'd be much nicer to have a definite behavior I can rely upon regarding np.nan, which personally I don't consider as garbage. Same with np.inf.

Even if you wish to not guarantee any behavior, i.e the policy is garbage in garbage out, that should be written in the docs IMHO. In comparison, Numpy doesn't specify the np.nan policy of all of it's functions, but for example they do mention np.nanmean in the documentation page of np.mean, and they also provide np.ma for usecases like mine.

joaander commented 1 month ago

@doronbehar You are certainly welcome to perform a full audit of the entire freud source code and determine the NaN behavior of every method. As @tommy-waltmann stated, we are fairly confident that the vast majority of methods will produces NaN output given a single NaN input (e.g. Box, Cluster properties, MSD, PMFT, RDF, local density, correlation functions, order parameters). These methods perform computations on your inputs with no NaN checking whatsoever, so any input NaN will propagate to the output.

You have found a corner case with the neighbor list where the input is used primarily in a comparison. Neighbor lists use values in two ways: 1) Inserting particles into a spatial data structure and 2) comparing the distance between particles. For the 2nd step, the NaN comparisons (<, >) will both return false. I assume this is what leads to the behavior you obtain.

I have no idea how the NaN fit into the spatial data structure in a sensible way in your example. I think using NaN with a larger set of points will give you very different behavior. Your small example fits all particles in 1 node. When building multiple nodes, AABBQuery internally sorts particles based on their coordinates. Due to the NaN comparison rules, it is impossible to sort a sequence containing NaN. Therefore, the underlying assumptions that AABBQuery makes are invalid and could lead to undefined behavior (e.g. infinite loops, extremely slow performance, missing particles that should be neighbors, and many other things are possible). Other types of NeighborQuery methods may even cause a segmentation fault with unchecked NaN input.

I think the solution to this issue is that AABBQuery (and all other NeighborQuery data structures) should raise an error when NaN is provided for an input coordinate. HOOMD-blue does this. As you suggest, we can document this error condition clearly.

doronbehar commented 1 month ago

Thanks for kindly explaining this. I understand now your concerns and indeed I would never have imagined how these data structures are probably influenced in such a non-trivial manner by a few np.nan there. Never the less, personally for my usecase, I got the desired behavior when my data contained a few np.nan - these particles simply were ignored in the final result so it seems. I understand though that they weren't ignored in the AABBQuery class, and it is safer to filter out the nan particles completely before using them to instantiate an AABBQuery.

I agree that raising an error, and writing a word on that in the docs would be great.