pynbody / tangos

The Agile Numerical Galaxy Organisation System
BSD 3-Clause "New" or "Revised" License
19 stars 13 forks source link

Inconsistent results when using "latest()" for a halo, or across all halos in a timestep #127

Closed Martin-Rey closed 4 years ago

Martin-Rey commented 4 years ago

Hi all,

This is potentially a follow-up to #123, or might be an unrelated issue that I hadn't noticed before.

I am obtaining different results when using latest() for all halos in a timestep with calculate_all, compared to using it for each individual halo in turn. Most of the results are consistent between these two use cases, but I am attaching an example where they are not:

In [1]: import tangos                                                                                                              

In [2]: timestep = tangos.get_timestep("Halo599_DMO/output_00095")                                                                

In [3]: latest_ids = timestep.calculate_all("latest().halo_number()")[0]                                                          

In [4]: halo = tangos.get_halo("Halo599_DMO/output_00095/halo_21")                                                                

In [5]: halo.calculate("latest().halo_number()")                                                                                  
Out[5]: 1

In [6]: latest_ids[20]                                                                                                            
Out[6]: 26

Calculating latest() from Halo 21 at snapshot 95 returns a link to Halo 1 at z=0 (Out[5], which is correct), while it returns a link to Halo 26 when calculating it at the timestep level (Out[6]). I noticed this problem first while browsing the web interface, when clicking onto a halo from the timestep page would change the result from latest().

The issue is not specific to this halo, I can reproduce it with other examples. Most examples that I have found failing are halos that merge with something else before z=0 and "disappear" into another halo. I can't spot anything obviously wrong with the database, and as an example, I checked that there are no links existing at any point of their history between my example Halo 21 and Halo 26 (the result from latest() at the timestep level).

My guess is a difference in behaviour between the MultiHop strategy used for a single start point, and the MultiSource strategy used for multiples halos at once, but I have found it very hard to pinpoint it more precisely.

Any ideas or inputs? I can provide the database for diagnosis if needed.

Martin

apontzen commented 4 years ago

@Martin-Rey I think I have fixed this in PR #128. Please confirm.

For information, there is also a bug in your test. You cannot take for granted that latest_ids[20] refers in fact to halo 21, because if there is no result for latest() then calculate_all will eliminate the row.

For a robust test you should use

halo_nums, latest_nums = timestep.calculate_all("halo_number()", "latest().halo_number()")

then find halo 21 in halo_nums and look up its corresponding value in latest_nums. However in this particular case the bug in your test is theoretical -- it should have behaved as you expected, and in PR #128 I believe it will do.

Martin-Rey commented 4 years ago

Hi @apontzen. I can confirm that #128 fixes the problems I was seeing

Thanks for pointing out the bug in the test. Since all the halos in this database are linked to z=0, they all have a latest() and I didn't think about the case when they wouldn't.

I'll close this issue, thanks for the quick work!

Martin-Rey commented 4 years ago

I am re-opening this issue, because latest() calculated at the timestep level is now correct as far as I can tell, but with extremely degraded performance. It takes ~10min to compute latest() for all halos in the example database I provided, compared to the few seconds it used to take.

I wanted to check whether this a necessary downside of having the new, more robust implementation?

Martin

apontzen commented 4 years ago

I don't see this performance issue (would have noticed if it was taking 10 minutes!) - could you provide details?

Martin-Rey commented 4 years ago

Ok, so the main performance issue is probably that I am not trimming my merger trees at all in my local configuration:

mergertree_min_fractional_NDM = 0.0 # as a fraction of the most massive halo at each timestep - set to zero for no thinning

which probably amplifies any difference in performance.

However, here is a timing of latest() from the latest commit on the current master branch, with the same (non)-trimming setup

In [1]: import timeit 

In [2]: timestep = tangos.get_timestep("Halo599_DMO/output_00050") 

In [3]: start_time = time.time() 
    ...: timestep.calculate_all("latest()") 
    ...: print("--- %s seconds ---" % (time.time() - start_time)) 
--- 604.0164349079132 seconds ---

And at commit b2d9b602164e49346856f34cdfa588e4114d42d1, i.e. before PR #128 but after the beginning of tinkering with earlier() and later():

In [1]: import time                                                                                                             

In [2]: timestep = tangos.get_timestep("Halo599_DMO/output_00050")                                                              

In [3]: start_time = time.time() 
   ...: timestep.calculate_all("latest()") 
   ...: print("--- %s seconds ---" % (time.time() - start_time))                                                                
--- 244.97739696502686 seconds ---

And finally at commit a8672ecb66d8f07bc0b50bea7eb1e3b0d26f9f7f, before PR #128 and PR #126


In [1]: import time                                                                                                             

In [2]: timestep = tangos.get_timestep("Halo599_DMO/output_00050")                                                              

In [3]: start_time = time.time() 
   ...: timestep.calculate_all("latest()") 
   ...: print("--- %s seconds ---" % (time.time() - start_time))                                                                
--- 7.447889804840088 seconds ---
apontzen commented 4 years ago

Thanks. #129 should significantly improve the performance again, through some quick wins. It probably will not return to the performance of https://github.com/pynbody/tangos/commit/a8672ecb66d8f07bc0b50bea7eb1e3b0d26f9f7f since the algorithm is in fact better and searches more of the tree. But it should be much better than the other two point checks you gave.

If you confirm, I will merge and push this, and also upload to pip since it's a very significant enhancement.

Martin-Rey commented 4 years ago

Hi Andrew, thanks for the quick work. I can confirm that the performance on #129 is much better for the same snippet of code.

In [1]: import timeit 

In [2]: timestep = tangos.get_timestep("Halo599_DMO/output_00050") 

In [3]: start_time = time.time() 
    ...: timestep.calculate_all("latest()") 
    ...: print("--- %s seconds ---" % (time.time() - start_time)) 
--- 26.69350028038025 seconds ---

As you mention, the performance is slightly worse than before #128 and #126, but it is back to very acceptable levels. I am very happy for you to merge #129, close this issue and push to pip whenever practical!