Print out info when running

Charlottez112 commented 2 years ago

Is your feature request related to a problem? Please describe. Currently freud does not provide options for printing out info when it's running. This can be helpful especially if the computation takes a long time to finish. For example, when I use EnvironmentCluster and EnvironmentMotifMatch and set registration = True, it usually takes a very long time, and I will have to wait until it finishes to check if the results make sense.

However, if there's an option to print out things like a progress bar, or the current number of clusters etc, we can better monitor the process and maybe end things early if the results seem wrong.

Describe the solution you'd like Add an option to generate outputs such as the number of clusters found so far, and the number of particles left. Or even print out an estimate for the time it will need to finish the computation.

tommy-waltmann commented 2 years ago

One solution to the problem here is to just test the calculations on a subset of particles in the system to see if they are giving the desired results. This would allow you to determine if the calculation is giving the desired results without it taking too long.

What your proposed solution is hinting at here is the idea of a verbosity level. HOOMD has implemented this for debugging certain components, but the use case would be a bit different here. I'm not opposed to the idea of adding verbosity levels to the compute methods, because the concept could be widely applied if needed across many of freud's modules in a way that keeps the API consistent. We would need ensure that this won't introduce a significant performance regression, though.

I'm also interested in exactly what is meant by "the results seems wrong". Are your results indicating there may be a bug in the module's implementation, or do you just mean the results aren't what you would ideally like them to be?

joaander commented 2 years ago

HOOMD has verbosity controls up to and including debug messages on the order of the timestep. HOOMD doesn't provide anything on a per-particle or per-pair level like you propose here to enable progress bars in freud. One needs to be very careful to avoid performance degradation when adding to the innermost loop.

For example: Freud uses TBB to multithread most operations. Python's interpreter is single threaded. Any sort of callback that enables a progress bar that calls the Python interpreter would cause the whole calculation to run in serial.

Charlottez112 commented 2 years ago

One solution to the problem here is to just test the calculations on a subset of particles in the system to see if they are giving the desired results. This would allow you to determine if the calculation is giving the desired results without it taking too long.

I thought about this but using only a subset of particles we won't be providing the correct neighbors, so I feel like this might cause more problems.

I'm also interested in exactly what is meant by "the results seems wrong". Are your results indicating there may be a bug in the module's implementation, or do you just mean the results aren't what you would ideally like them to be?

For example, we might need to try different number of env_neighbors. For different env_neighbors, the number of total clusters would be different. I was using this on a BCC crystal with a little bit of Gaussian noise added (total number of particles: around 8000), and I was expecting 1 or 2 environment clusters, but it found like 4000 different clusters, with registration = True. If, e.g., when the search reaches the 4,000th particle and could print out the number of clusters found (I imagine it would be over 1000 at this point), I would stop the search because it's already much more than what I expected.

Charlottez112 commented 2 years ago

HOOMD has verbosity controls up to and including debug messages on the order of the timestep. HOOMD doesn't provide anything on a per-particle or per-pair level like you propose here to enable progress bars in freud. One needs to be very careful to avoid performance degradation when adding to the innermost loop.

I didn't mean tracking the computation exactly. I was thinking something like roughly how many particles it has checked vs the total number of particles. Would this slow down the computation significantly?

tommy-waltmann commented 2 years ago

For example: Freud uses TBB to multithread most operations. Python's interpreter is single threaded. Any sort of callback that enables a progress bar that calls the Python interpreter would cause the whole calculation to run in serial.

I didn't have a progress bar or some kind of callback in mind. I was thinking about the possibility of some amount of information printing at the C++ level, which may be possible without significant performance penalty.

bdice commented 2 years ago

I thought about this but using only a subset of particles we won't be providing the correct neighbors, so I feel like this might cause more problems.

For many analysis methods in freud, you can test subsets of the system by providing a full system as points and only a part of the system as query_points. Another way to test small subsets is to generate the neighbor list for the whole system (generally this is not expensive), then subset the system and neighbor list to include only some of the particles.

The EnvironmentCluster and EnvironmentMotifMatch are probably the slowest features in all of freud, and they are highly sensitive to the input parameters. It is quite difficult to determine what thresholds to use. I suggest reading @erteich's past papers to try and get a sense for the value ranges that are good to use.

You may also find some helpful scripts in this repository of paper figures used in the freud publication (private but accessible by glotzerlab/group-members): https://github.com/glotzerlab/freud-paper-figures/tree/master/figures/matchenv

For the system of twinned icosahedra (fcc and hcp layers) shown in the freud paper, we used these parameters. Note that the API has changed since this time, but the meaning of the parameters is the same, if I recall correctly. The important one is the threshold, I guess. But I will add that the way you find neighbors (ball query vs. k-nearest) is also important for establishing similar sets of vectors. k-nearest is probably the more robust choice for this analysis.

rcut = 1.5
kn = 12
threshold = 0.45

# Build the neighbor list explicitly so we can use it later when picking out motifs.
nn = freud.locality.NearestNeighbors(rcut, kn).compute(box, positions, positions)
me = freud.environment.MatchEnv(box, rcut, kn)
me.cluster(positions, threshold, nlist=nn.nlist, env_nlist=nn.nlist)

If necessary, you can always compile freud from source and insert C++ printouts like std::cout << "particle index: " << i << std::endl;. However, as Josh said, the TBB interaction means it's not straightforward to hook into the current progress from Python and retain efficient parallel execution.

Charlottez112 commented 2 years ago

Thank you @bdice! This is very helpful. As you pointed out, the problem I have with the EnvironmentMatch module is that it's highly sensitive to the parameters. I have read the related paper, but still find it hard to tune the parameters for crystals that are more complex, e.g., crystals with larger unit cells. But this is not the point of this issue.

Charlottez112 commented 2 years ago

If you guys think it's better not to add this option, let me know and I can close this issue. Now I feel like this is more of a problem of the Environment module. In most cases we probably don't need to print out anything.

tommy-waltmann commented 2 years ago

The subset approach and compiling from source/adding print statements seem like the way to go with this issue. Let me know if you need help with any of that.

vyasr commented 2 years ago

@Charlottez112 what you're asking for also seems in line with #278. Feel free to reopen that and propose further improvements if you want to try to incorporate better guidance on the use of the environment matching. I closed that issue because I made significant changes to the code that helped, but the underlying issue of understanding how to choose good parameters still remains. Unfortunately I don't know that anyone has real "expertise" with it, just guidelines from trial and error experiences.

glotzerlab / freud

Print out info when running #967