Can't reproduce benchmark problem 1 results

smaret commented 8 years ago

The fractional populations computed by LIME for the benchmark problem 1 (see PR #146) differs significantly from the values of van Zadelhoff et al. (which are labeled as "fiducial" in the figure below). benchmark1 For the figure above I used 4,000 grid points, log sampled as a function of the radius. The results are in better agreement for a larger number of grid points (I tried with 10,000 points), but there is always a slight offset between LIME and the fiducial model.

smaret commented 8 years ago

CC @christianbrinch @tlunttil @imckstewart @allegroLeiden

tlunttil commented 8 years ago

Well, this is definitely something that needs to be looked into. Which LIME version did you use in these tests? Was there a version when LIME did still agree with the benchmark results?

allegroLeiden commented 8 years ago

Do you have a comparison with lime-1.3? Since that was kind of our starting point. Either of two situations will be true: (i) the mismatch is the same now as it was for 1.3; (ii) it is different. Only (ii) will be easily addressable.

There is also here a question of priorities. It seems to me that we are still really pre-alpha with Lime, in that we are still sorting out numerous structural issues, traps and seg faults. Code is no good if it doesn't produce the right numbers but it is even more no good if it seg faults every time the user ventures a little way 'off piste'. The other major work being done on Lime at the moment consists of speedups and the introduction of break points with comprehensive file IO - both of which will make the task of debugging much easier. Thus, while eventually of course such tests of output against fiduciary values will be essential, just at present I can't help feeling that it would be a little premature to divert much effort in this direction.

tlunttil commented 8 years ago

There are a couple of additional quick checks that would be good for testing:

Does changing the number of global (ALI) iterations have an effect (NITERATIONS in lime.h, default is 16)?
Does changing the number of directions/vel. offsets per grid point neighbour have an effect (ininphot in lime.h, default is 9)?

allegroLeiden commented 8 years ago

Another item which has long been on my wish list for Lime is a suite of test harnesses which test the output of small parts of the code - at the individual function level. Such tests are much easier to debug and, in the ideal world in which funding for software development actually matched the requirements, should precede wholistic testing, since if any one of the parts is not right, clearly the whole can't be right. And at the really basic level even below that is the kind of line-by-line sanity checking which @tlunttil and I have intermittently been labouring over these past 6 months. So I see value testing as a multi-level approach in which the wholistic test is the very last thing one tries.

tlunttil commented 8 years ago

Well, that benchmark tests only a half (or maybe two thirds) of the whole thing, since we're comparing level populations instead of final raytraced spectral lines. So we're testing really only the Jbar computation and solving the balance equations. And because LIME recovers correctly the thermalised T_ex=T_kin populations in the inner part and optically thin T_ex=T_CMB in the outer part, my hunch would be that problems in Jbar computation are to blame. But even if that's true there are several different possible causes. Some are 'benign': insufficient number of grid points, ray directions/velocity offsets, or iterations, but obviously there's also the possibility of real bugs. It seems the LIME results look like they had effectively a too high optical depth; this is what the known issue #125 would do. The discrepancy looks pretty large to be caused only by that, though.

smaret commented 8 years ago

Which LIME version did you use in these tests?

I used the HEAD version.

Do you have a comparison with lime-1.3

Yes, I also tried to run the benchmark with v1.31 (8c0a0ad), and it gives the same results than the HEAD version: benchmark1_4000pts_v131

tlunttil commented 8 years ago

Ok, so it's not something our recent extensive changes have caused. That's good, and bad...

I'll probably have a try with this tomorrow. I have some theories about what is causing this that I could check pretty quickly.

smaret commented 8 years ago

I've tried to run the same model (1a) with LIME HEAD version, for different numbers of grid points, and here are the results: benchmark1a_pts

The agreement with the fiducial model is better for larger numbers of grid points. Note the model with 256000 grid points (!) has only 8 ALI iterations, while the others have 16.

smaret commented 8 years ago

Does changing the number of global (ALI) iterations have an effect (NITERATIONS in lime.h , default is 16)?

Indeed this has an effect on the result, but this not what I would have expected:

benchmark1a_iter

After 2 ALI iterations, the LIME result is in pretty good agreement with the fiducial model. But as the number of iterations increases, LIME results seem to converge towards a different solution. The dependence on the number of grid points is probably a natural consequence of this: for a given number of iterations, a larger number of grid points gives a lower final SNR, and therefore the solution is closer to the fiducial model.

allegroLeiden commented 8 years ago

It doesn't directly bear on this, but I'll just remark that it is quite difficult to judge convergence with the current algorithm, because the choice of new photon directions with each iteration shuffles the cards. I wrote a branch fixed_photons to explore the convergence with this shuffling turned off, and found that it was both orderly and relatively fast. I never made a pull request on this because I think we need to fix the structure of the code a bit first, but it is a good example of the changes we should IMO be contemplating before we can really begin to address the performance of the basic algorithm.

smaret commented 8 years ago

Another item which has long been on my wish list for Lime is a suite of test harnesses which test the output of small parts of the code (...) So I see value testing as a multi-level approach in which the wholistic test is the very last thing one tries.

I think that both are needed. Unit test are extremely useful for testing the mode (ideally these should be written at the same time as the main code though -- writing a test suite afterwards is a real pain). However a test suite won't allow you to find potentials issues with the code algorithm; only a benchmark with other codes can.

I'll just remark that it is quite difficult to judge convergence with the current algorithm

For this specific case, you can see that the population "shift left" as a number of iteration increases up to 16, and then stay the same for 32 iterations. So it seems to me that the code as converged, but towards a solution which has an higher opacity than the fiducial model, as pointed out by @tlunttil .

tlunttil commented 8 years ago

I had today a closer look at this issue. It seems to me that this is more of a feature of the current LIME algorithm than a real bug, and the problem is greatly amplified in the case of a strongly inhomogeneous density and grid point distribution (especially if the number of points is low). Changes to photon tracking similar to ex-PR #83 reduce the discrepancy, but at least with a smallish number of grid points the issue remains.

The figure shows the results of the base case (4000 points, 16 ALI iterations) after my changes. You can see that it looks a bit better than in @smaret 's first figure.

figure_1

smaret commented 8 years ago

I had today a closer look at this issue. It seems to me that this is more of a feature of the current LIME algorithm than a real bug, and the problem is greatly amplified in the case of a strongly inhomogeneous density and grid point distribution (especially if the number of points is low).

Thanks for looking at this. Do you know which assumptions in the algorithm cause this apparent overestimation of the opacity? Is it a consequence of the "zig-zagging" that you described in #125? Also why is the agreement better for a small number of ALI iterations?

tlunttil commented 8 years ago

It's partly the zig-zagging, although in this case the "half-interval mismatch" seems to be more important. All in all it's a number of factors, which I'll try to write up in more detail at some point.

The reason less iterations give a better match is just a coincidence. Populations are initialised (in iteration 0) according to the optically thin approximation and they take ~15 iterations to converge. At some point the populations happen to go through the correct values on their way to the full convergence. If the populations were initialised to LTE, a small number of iterations would (probably, haven't tested this) give an even worse result.

allegroLeiden commented 8 years ago

I was wondering if this would improve after we get better grids, i.e. which follow the density better, and/or which the user has more control over.

Apropos as a kind of diagnostic tool for that, I have been thinking about the best way to output the maximum i.e. line centre opacity per grid edge/link. (Perhaps calculate the maximum opacity per grid point, then could output a spatial image cube by linear interpolation between points..?) The idea with this would be to give the user some sort of feedback as to whether their point sampling was adequate. I guess the ideal is to have typical line-centre opacity on each edge/link of order unity?

tlunttil commented 8 years ago

I was wondering if this would improve after we get better grids, i.e. which follow the density better, and/or which the user has more control over.

Maybe. However, it's a delicate issue. If we have a grid point sitting in a region with a strong density gradient and the number of grid points follows density, I think we'll have a situation where the number of neighbours tends to be higher towards the dense regions. I believe that in the current algorithm this will lead to a bias such that the neighbours found in the photon propagation loop tend to be in the higher rather than lower density regions, which will lead to an overestimation of optical depth. I think that a variation of the algorithm (that we discussed in some emails some time ago) where instead of choosing the edge with the smallest angle to inidir, the neighbour closest to the "true" ray were chosen, would reduce but not eliminate the bias. I'll probably try that soon.

rschaaf-aifa commented 8 years ago

The point that @tlunttil describes is discussed under the title "diffuse drift" for the SimpleX algorithm in section 3.3.2 of Chael Kruip's doctoral thesis (https://www.strw.leidenuniv.nl/events/phdtheses/kruip/). He discusses several other effects of what he calls "diffusive transport" and "ballistic transport" and corrections of these effects by weighting schemes. I have not yet followed this reasoning in any detail, but it may be worth the effort...

allegroLeiden commented 8 years ago

@tlunttil is not a huge fan of the simplex algorithm, I believe. :smile:

christianbrinch commented 8 years ago

I havent really followed this discudion since I am on vacation. I just stumbled on this particular message.

The issue being raised by @tlunttil is well known and we solved it prior to version 1.0! The fact that more grid points tend to lie toward denser regions does skew J-bar. LIME has dealt with this in two different ways. First weights were introduced which would weigh the intensity of a photon by the angular area of its Voronoi facet. In the second method (which is still in place unless someone has removed the code), photons are emitted in random directions and will propagate along the closest Delaunay connector. Both methods ensure that the intensities are sampled in an unbiased way. This was the feature that made LIME pass the van Zadelhof benchmark the first time.

Long story short: it is a non-issue (Unless it has been tinkered with)

tlunttil commented 8 years ago

The fact that more grid points tend to lie toward denser regions does skew J-bar. LIME has dealt with this in two different ways. First weights were introduced which would weigh the intensity of a photon by the angular area of its Voronoi facet. In the second method (which is still in place unless someone has removed the code), photons are emitted in random directions and will propagate along the closest Delaunay connector. Both methods ensure that the intensities are sampled in an unbiased way. This was the feature that made LIME pass the van Zadelhof benchmark the first time.

I remember seeing parts of the weighting method code in one of the first versions that I saw (1.3?), but it wasn't in use then and I think the code has been removed since. Now the algorithm is basically an inverse Monte Carlo (or long characteristics in randomized directions) which uses the underlying Delaunay edges as a sort of look-up structure for finding the neighbouring grid points. But as far as I can see, this lookup introduces a bias in areas where the distribution of Delaunay edges is (strongly) non-isotropic. I had quick look at the thesis that @rschaaf-aifa mentioned, and their "diffuse drift" indeed seems to be closely related to the issue in (current version) of LIME.

In short, no version I've seen avoid this (or some closely related) problem. If someone has an old version that works differently (and passes the benchmark), I'd be very interested in seeing it.

allegroLeiden commented 8 years ago

I remember seeing parts of the weighting method code in one of the first versions that I saw (1.3?), but it wasn't in use then

Me too. I should maybe make the point to @christianbrinch that there have, as yet, been very, very few if any changes made to the central algorithm, because the likely effects of any changes have not been clear. We've been pretty conservative with it. The only change I can think of off hand is the one Tuomas and I are presently discussing in #118, and I am going to revert that. There have been innovations in the gridding and the raytracing, but we have not really begun on the core, although there are plenty of plans in this direction. As proof of that note that Sebastien says the benchmark failure is exactly the same with 1.3 as with the current master. When I inherited responsibility for Lime, I gathered together all the versions I could find. I wrote to you at that time, if you recall, asking for advice. You replied that you had left the project and had no useful information you could give me. So if there are changes you introduced to Lime in versions not readily available then I find it difficult to see why you should expect that they have been propagated into the present project.

christianbrinch commented 8 years ago

Uhm, I don't think I said anything about me having added anything to LIME recently. I haven't done anything since version 1.41 and this is the version that Sebastien has.

My only point was that the specific problem (which is indeed a problem) of having more Delaunay connectors toward denser regions is already solved. A long time ago. 2010, I believe.

Sent from my iPhone

On 14. jul. 2016, at 09.46, allegroLeiden notifications@github.com<mailto:notifications@github.com> wrote:

I remember seeing parts of the weighting method code in one of the first versions that I saw (1.3?), but it wasn't in use then

Me too. I should maybe make the point to @christianbrinchhttps://github.com/christianbrinch that there have, as yet, been very, very few if any changes made to the central algorithm, because the likely effects of any changes have not been clear. We've been pretty conservative with it. The only change I can think of off hand is the one Tuomas and I are presently discussing in #118https://github.com/lime-rt/lime/pull/118, and I am going to revert that. There have been innovations in the gridding and the raytracing, but we have not really begun on the core, although there are plenty of plans in this direction. As proof of that note that Sebastien says the benchmark failure is exactly the same with 1.3 as with the current master. When I inherited responsibility for Lime, I gathered together all the versions I could find. I wrote to you at that time, if you recall, asking for advice. You replied that you had left the project and had no useful information you could give me. So if there are changes you introduced to Lime in versions not readily available then I find it difficult to see why you should expect that they have been propagated into the present project.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/lime-rt/lime/issues/147#issuecomment-232588585, or mute the threadhttps://github.com/notifications/unsubscribe/AHZ8XTlC2BQJXUU-E6ycCcRcp47gxPKvks5qVelYgaJpZM4JJaxO.

ParfenovS commented 8 years ago

I've seen the code that uses the photon intensity weighting by the angular area of Voronoi facet. And I've restored possibility to weigh the intensity. In short, it can be done relatively easy using getArea function in grid.c file and multiplying mp[0].jbar[iline] and vsum variables in getjbar function by the new variable mp[0].weight[iphot]. mp[0].weight[iphot] is initialized in photon function and equals to g[here].w[dir](I can give more comments on code changes if someone will be interested).

Here are benchmark results without weighting and with 4000 points: initial

And this is what I have with weighting and 4000 points: weights

The following are results with weighting and 20000 points: weights20000

All these results were obtained after 16 iterations.

ParfenovS commented 8 years ago

I also want to mention that doppler value in mode1a.c and model1b.c files is not correct. The code assumes that doppler is a turbulent width and adds a thermal width to the doppler value in molinit function. The width a=0.15 km/s given by Zadelhoff et al. is a total thermal+turbulent linewidth. Thus, the doppler value should be: sqrt( 150._150. - 2._KBOLTZ/AMU);

tlunttil commented 8 years ago

Thanks to @ParfenovS for noticing the linewidth. Unfortunately the change makes things even a bit worse, because it makes the optical thickness even larger.

I ran a few more tests with 4000 points to see how many iterations it takes to converge. The answer seems to be that in model 1a about 32 and in 1b 'a lot'. I ran these with a version with #83 fixes and something that is probably similar to 'fixed_photons' that @allegroLeiden mentioned (i.e. reset PRNGs for each iteration and related stuff).

It's not really a surprise that 4000 points is not nearly enough in an optically very thick cases for a 3D code that doesn't assume spherical symmetry. I'm also not sure if the distribution of points is quite optimal here, they seem to be concentrated very heavily in the inner part.

The important thing is of course if the solution converges to the correct one when the number of points is increase and if so, how many points are needed. I'm planning to run some tests, but it takes a while. (As an aside, how long does the grid construction+triangulation take for you? Even after I reduced the number of smoothing iterations to 5, it takes about an hour for 40000 points - much longer than the actual calculation! Is there something fishy in my qhull installation?)

figure_1a figure_1b

ParfenovS commented 8 years ago

Indeed, 4000 points is too small. And distribution of points is not very optimal. I'm also making some test calculations. The following result was obtained after 16 iterations with 16000 grid points distributed with the weights weight1=pow(totalDensity/normalizer,0.05); with the changes of photon tracking from PR #83 and with the weighting of photons: 16000_9_0 05_weight_track_modrand

The power=0.05 for weight1, changes of photon tracking and weighting of photons significantly affect the difference from fiducial model.

The grid construction can take really much time. In my experience, it doesn't depend significantly on qhull installation.

smaret commented 8 years ago

I also want to mention that doppler value in mode1a.c and model1b.c files is not correct.

Thanks @ParfenovS. This is fixed with 0554a74

smaret commented 8 years ago

@ParfenovS @tlunttil Could you please used the HEAD version for your future tests? Otherwise I can't reproduce the results. If some changes are needed in the code (such as the one introduced in #83), please open a PR for it.

tlunttil commented 8 years ago

I've started running some different benchmarks with LIME against another code I've used/worked on (actually a descendant of one of the codes in van Zadelhoff et al). The two tests from the paper are quite tough for a 3D code, so real benchmarking with only them becomes extremely time consuming.

I'll be testing various modifications to the code to see if LIME and the other code converge to the same result and what kind of grids are needed in LIME for this to happen. I'll open a PR (or a new issue) once I have something figured out.

As far as this issue goes, I think that the answer is a) due to (very) high optical depths, a very large number of grid points would be needed in a 3D code and b) there are also some quirks/bugs/features in LIME that seem to make things even a bit worse for convergence (as far as the number of points goes, and possibly also whether it would ever converge to the right answer).

aschmiedeke commented 7 years ago

This issue was just pointed out to me, so please excuse my (kind of) delayed reply. I ran a few tests including all versions of lime that I could find on my computers (including very early versions such as v.1.03, v.1.21; my version 1.4, which Ian had merged with the general (initial) LIME-rt version, and the more recent v1.5 as well as the latest (last week friday) v1.6.1).

For model 1a I obtain the following: model1a_results

For model 1b I obtain: model1b_results

Note: For consistency, I have tried to keep all setup parameters between different versions fixed. I have used 4000 points, 12 iterations, the sampling is set to 0 everywhere and I used no anti-aliasing. I have no clue what is going on with v1.03 in the optically thick case.

I have also worked on an implementation of the van Zadelhoff+2002 model 2. I fitted their tabulated values for the density, gas temperature, radial velocity and microturbulence (first four plots in the next figure) and so far plotted the level populations (I have not yet made any comparison of the excitation temperatures, calculated intensities; the van Zadelhoff+2002 Figures 7 and 8). For the level population, I extracted the "Hogerheijde results" from the plots provided in the workshop webpage.

model2

tlunttil commented 7 years ago

Thanks a lot @aschmiedeke . As I wrote earlier, these tests are hard for a 3D code and it's no surprise that with only 4000 points there are differences to reference solutions. Only vZ+ model 2a has such low optical depths that Lime isn't hurt too much by its coarse 3D grid. It's encouraging that in that benchmark results look to be quite close to the reference.

That said, Lime had, and still has some 'features', which cause problems. Many of these were fixed in 1.6 and it's good to see that at least some improvement has happened. Soon-to-be-released 1.7 should bring further improvement.

aschmiedeke commented 7 years ago

@tlunttil, I agree, these tests are hard (but they are necessary). For completeness, here are the results using v1.5 for the model 2a/b benchmark. As can be seen, LIME was hurt by the coarse grid (I am still using 4000 points in these model 2 tests) even in the low-abundance case.

model2

I will try to take some time this week to put the setup of the vZ+ model 2 benchmark in a new PR so that you don't have to repeat the effort of setting this up.

christianbrinch commented 7 years ago

Can I ask you guys what molecular data file you are using? If you want to do the van Zadelhoff benchmark, you better be using the HCO+ datafile that was available in 2002 and not the one you can get from LAMDA, otherwise you will not be able to reproduce the benchmark results.

tlunttil commented 7 years ago

The two level case (benchmark 1) used the file that is included in PR #146. I haven't tested benchmark 2, I think @aschmiedeke is the only one who has looked at that recently in any detail.

aschmiedeke commented 7 years ago

Hi Christian, I unfortunately had no time yet to upload the vZ benchmark model 2a,b. But I was indeed aware of the change in the molecular datafile and produced a lamda-style molecular datafile that contained the values used in the vZ benchmark.

christianbrinch commented 7 years ago

I just ran van Zadelhoff model 2a/b with version 1.41. Here are my results (which, by the way, are pretty similar to Fig 10 in the original LIME paper which was based on LIME 1.03): figure_10 figure_11 Can we please agree on that the results are not "all over the place"?

lime-rt / lime

Can't reproduce benchmark problem 1 results #147