Open km-git-acc opened 6 years ago
@km-git-acc
Making progress on the matrices and multiplication (with implicit summation) is working technically now :-)
Could however not yet reconcile the data with nsumsapremat
and nsumsbpremat
. I noted that the dimensions of both aretaylorterms, expterms
(i.e. transposed to what you propose), but that should not influence the matrix-elements, right?
The error in the Barrier V1 script still bothered me and I actually have found that the Zarr-vector wasn't properly cleared. This could be the bug you experienced. Attached is the updated code. Could you have a quick look?
https://github.com/km-git-acc/dbn_upper_bound/blob/master/dbn_upper_bound/arb/BarrierV1c.c
@rudolph-git-acc The error message has changed a bit - Error in `./a.out': free(): invalid pointer:
Also, I was able to make the matrix multiplication work..it works well till about X=10^8..but after a point the HxH matrices bmat and amat (which are actually diagonal and hence quite sparse) get really large and I run out of memory. If the Arb library can manage the multiplication with sparse representations, that will solve the bottleneck. As a backup, I have written another storedsums function which does not use matrix multiplication for the main steps at the end.
Also, I have simplified a bit in that storedsums now returns (apart from n0) just two very long polynomials dependent on formal variables xval and tval. These two polynomials are continously actual x and t fed values in abbapproxarr Attaching the updated script barrier_multieval_t_agnostic_v2.txt
There has also been an interesting suggestion on the blog to consider Chebyshev polynomials instead of Taylor polynomials for the exponential (which would then reduce the expterms or ttaylorterms needed). Maybe we can explore that once this script is finalized.
@km-git-acc
Thanks for yet another simpler version!
I am now starting to struggle with replicating this code in ARB. Storing xval or tval based polynomials in a matrix and then using something like polyeval
is not supported in ARB (or at least I don't see a way how to accommodate this). Matrix multiplication can quite easily be done, so a version without the polya/b would be great. There is support in ARB for real an complex polynomials (but not in a matrix) and reading through the ARB documentation I also found a function named TAYLOR-SHIFT (no, not Taylor Swift ;-) ) that apparently can rapidly evaluate a shift from f(x) to f(x+c). Could that be useful for our problem?
I did test whether the diagonal HxH matrices would work in ARB at large heights, however it also runs into a memory error above 10^9.
I found one more matrix that is not being properly cleaned in the Barrier V1 script. It is difficult to test whether this fixes it since I can't replicate the error on my machine. I (once again) made the mistake to load the file in the wrong folder. Sorry for that, maybe it can be moved.
https://github.com/km-git-acc/dbn_upper_bound/blob/master/dbn_upper_bound/BarrierV1c2.c
Let me know if this one works.
@rudolph-git-acc I still got the error Error in `./a.out': free(): invalid pointer:. Since it is working on your machine, maybe the same script can be tried out on a cloud machine, or you could tell me where the vectors/matrices are cleared and what is the general way to do it. Another option could be where for testing, we don't clear any pointers, and gradually introduce them till the error is located. Have also moved the file to the arb folder.
I think the polynomial approach works in parigp since it supports symbolic algebra. Creating a non polynomial version will not be that difficult. I will try creating one later today.
Taylor shift does look like it does what we want in terms of helping calculating abbeff at many points, since if we have estimated it at some coordinate, we want to know the estimate at a nearby coordinate. But I will have to read up more to check how it can be combined with the existing algorithm.
@km-git-acc
A non-polynomial version would be great!
Also studied the Taylor-shift function a bit more and if we'd have the abbeff-function expressed as a polynomial-series, then we might be able to shift the x, y and t parameters quite easily.
I took another deep dive into the Barrier V1 script and found a potential cleansing problem when passing the ests
-matrix back to the main program. This should be fixed in this version.
https://github.com/km-git-acc/dbn_upper_bound/blob/master/dbn_upper_bound/arb/BarrierV1c3.c
@rudolph-git-acc Tried the latest version but getting the same error. The abbeff multieval function in the v2 script attached few comments back indeed gets expressed as a polynomial of z and t, which can be seen when the polyb or polya polynomial is printed. It should be possible to create this polynomial without resorting to matrices.
While working on the non-polynomial version, I realized that any approach which relies on storing vectors of length H or matrices of HxH or KxH will ultimately run into bottlenecks as N is taken higher. Right now I am trying to modify the calculation such that we instead sum up matrices of (expterms,ttaylorterms) H times.
@km-git-acc
Okay, have taken another angle to identify to problem. It seems to be related to a variable that is being cleared, but not properly initialised. With that lens I went through the code again and found that t
was defined, cleared but not initialised! Here is the updated code:
https://github.com/km-git-acc/dbn_upper_bound/blob/master/dbn_upper_bound/arb/BarrierV1c4.c
On the polynomials: I read quite a bit about the use of ACB-polynomials and believe it is possible to construct one likepolya
and polyb
. I would need to 'vectorise' them in a special way (not as easy as using a matrix, it requires explicit allocation of memory). It seems however that I still end up with H polynomials of order 40 or so (al least that is what polyb
looks like when I print it). Is that indeed correct?
@rudolph-git-acc The latest version is partly working. It starts giving output, although something happens in between. For example,
./a.out 1000 0.2 0.2 10 0 Rectangle(140722911784273) : 0.0000000000, 13.877352, 13.933777, -0.000000, 0.31621751574906603599, 52 ..many such lines.. finally Error in `./a.out': double free or corruption (!prev): 0x000055756bcad940
Also, on increasing X, the output is lesser before the error occurs, and beyond a certain X, we directly get the error. ./a.out 100000000 0.2 0.2 10 0 Segmentation fault
It seems to be related to memory management or memory exhaustion.
Also attaching the latest script (still polynomial based), which calculated the stored sums at X=10^11 in around 90 seconds (hence much faster than earlier (5 min for X=6*10^10)). The taylor polynomial is calculated recursively (may be available as a direct function in arb), and is calculated together for polyb and polya instead of twice, barrier_multieval_t_agnostic_v3.txt Working on a non polynomial version of the same.
@km-git-acc
Nice work, KM! The speed gain for the initial fill is amazing and if I understand your code correctly this should be much better scalable towards larger heights of X.
Thanks for the patience on the Barrier V1 script. With the latest error messages I believe I now have found the issue. The rectangle-counter wasn't set to zero explicitly (probably causing it to derail) and I actually introduced the segmentation error by adding the clearing of the ests-matrix that I pass back to the main program. Pretty confident now that this version should work.
https://github.com/km-git-acc/dbn_upper_bound/blob/master/dbn_upper_bound/arb/BarrierV1c5.c
@rudolph-git-acc The latest version is indeed working. Cool!
Yeah, the latest t agnostic stored sums version doesn't store any long vectors, hence will never run out of memory. Also as expected, the time required to compute the stored sums increases by around sqrt(10) times for each order of magnitude increase in X.
I have now tested the small matrix approach for the stored sums. Two matrices of dimension (expterms,ttaylorterms) are returned (while going around the rectangle in abbapproxarr, these matrices are indeed turned into polynomials of variable z, although with some additional work even this can be avoided if needed). At X=10^11, the matrix approach took only 67 seconds (from 87 seconds with the polynomial approach), resulting in an approx. 20% gain. I also tried an inbuilt taylor function in parigp with the polynomial approach which resulted in the timings reducing from 87 to 82 seconds, which shows that using inbuilt functions as far as possible will yield an advantage.
Have edited the existing file https://github.com/km-git-acc/dbn_upper_bound/blob/master/dbn_upper_bound/pari/barrier_multieval_t_agnostic.txt with both the approaches (only the matrix approach is currently activated)
Assuming we are ok spending around 2 hours to compute the stored sums, we could operate at X=10^15. And if we are fine spending a day on it, we could venture into X=10^17.
Ofcourse, once the same is done in Arb, these time estimates should reduce a lot. The workhorse functions in the matrix approach are taylorvec (which computes a vector of first n taylor terms of the exponential), and matet (which matrix multiplies the z and t taylor vectors). Optimizing these will keep giving a time advantage. Given what you had mentioned about inbuilt Taylor expansion in Arb, I am feeling quite optimistic about the first.
Also, I noticed there are 5 versions of the current barrier arb script in the arb folder. You may want to give the latest one a more official name to indicate it's final, and move the previous versions to an 'old' folder.
@km-git-acc
Great! I will start working on the ARB version.
To keep the directories clean, grateful if you could just delete all the 5 Barrier V1 versions and I will upload the final one.
@km-git-acc
Trying to reverse engineer your code, but got quite stuck already in the stored sums piece. Grateful for some help on this:
\\faster and accurate taylor vector
taylorvec(xval,nterms) = {
resvec = vector(nterms);
resvec[1]=1; for(ctr=1,nterms-1,resvec[ctr+1] = resvec[ctr]*xval/ctr);
return(resvec);
}
This is the vector xval^n/n! with index n=1..taylorterms
.
matet(xval,expterms,ttaylorterms) = mattranspose(taylorvec(xval,expterms))*taylorvec(xval^2/4,ttaylorterms);
gives for expterms(i)=taylorterms(j)= 5:
1 2 3 4 5
1 1/1^xval^0, 1/4*xval^2, 1/32*xval^4, 1/384*xval^6, 1/6144*xval^8;
2 1/1*xval^1, 1/4*xval^3, 1/32*xval^5, 1/384*xval^7, 1/6144*xval^9;
3 1/2*xval^2, 1/8*xval^4, 1/64*xval^6, 1/768*xval^8, 1/12288*xval^10;
4 1/6*xval^3, 1/24*xval^5, 1/192*xval^7, 1/2304*xval^9, 1/36864*xval^11;
5 1/24*xval^4, 1/96*xval^6, 1/768*xval^8, 1/9216*xval^10, 1/147456*xval^12;
so a matrix(i,j) = 4^(j-1)*xval^(i+2j-3) / ((i-1)!*(j-1)!)
matsba(vecab,mat) = [vecab[1]*mat,vecab[2]*mat];
this looks like a trick to create a 3D matrix to combine b
and a
where [1] is for sumb
and [2] is for suma
. Is that correct?
mattopolyt(mat,tval) = {
numrc = matsize(mat);
poly=0; for(i=1,numrc[1],for(j=1,numrc[2],poly = poly + mat[i,j]*tval^(j-1)*zval^(i-1)));
return(poly);
}
I understand that this procedure aims to convert the rows of a matrix into a polynomial. However it seems that this polynomial has 2 variablestval
and zval
although the zval
is not given as a parameter into this procedure. Two variables would be tough to put in a polynomial in ARB (so a polynomial free script would still be welcome!) Could you elaborate on what's happening here?
storedsums(X,y,expterms=50,ttaylorterms=40) = {
N = floor(sqrt(X/(4*Pi)));
H=N; n0 = H/2;
X=X+1/2;y=(1+y)/2;
expob = (-1-y+I*X)/2; expoa = (-1+y-I*X)/2;
finalmats = sum(h=1,H,matsba([h^expob,h^expoa],matet(log(h/n0),expterms,ttaylorterms)));
X=X-1/2;y=2*y-1;
print(finalmats);
return([n0,finalmats[1],finalmats[2]]);
}
What is happening in matsba([h^expob,h^expoa],matet(log(h/n0),expterms,ttaylorterms))
?
Do [h^expob,h^expoa]
correspond to the vecab[1]
andvecab[2]
vectors, i.e. is for instance h^expob multiplied as a scalar with mates(xval=log(h/n0), expterms, ttaylorterms)
and the same for the a-terms?
@rudolph-git-acc I have removed the V1 scripts.
Regarding different parts of the parigp script,
I tried out different ways of calculating the xval^n/n! vector and the one you asked about turned out to be the fastest, but different approaches may shine through in different libraries.
matrix(i,j) = (1/4)^(j-1)*xval^(i+2j-3) / ((i-1)!*(j-1)!) is indeed the same as multiplying the two concerned taylor vectors, although the latter was turning out to be around 2x faster. Not clear why.. could be that passing xval^2/4 as a parameter reduces the number of computations.
Yeah, once I have calculated matet(log(h/n0),expterms,ttaylorterms), I don't want to do that again once each for sumb and suma. I searched whether I could multiply a matrix with a list of scalars to get that many matrices (in this case 2), but all I could find was conventional vectormatrix dot product procedures, so matsba(vecab,mat) = [vecab[1]*mat,vecab[2]\mat] was a workaround. I wouldn't call it a 3D matrix, rather a list of matrices, but a 3D matrix may be an alternate way of doing this.
In mattopolyt, the intent is to always pass an actual value of t (within abbapproxarr), so the resulting polynomial should only have zval as the free parameter. If we want to completely remove polynomials, we could have an alternate function
mattoval(mat,tval,zval) = {
numrc = matsize(mat);
val=0; for(i=1,numrc[1],for(j=1,numrc[2],val = val + mat[i,j]*tval^(j-1)*zval^(i-1)));
return(val);
}
where we would now pass actual values of both tval and zval. For eg. the line
bsums = vector(num,v,n0^(bexpo[v]+logtn0/2)*polyevalz(polybt,bexpo[v]+logtn0));
would become
bsums = vector(num,v,n0^(bexpo[v]+logtn0/2)*mattoval(matb,t,bexpo[v]+logtn0));
similarly for asums. I will try putting it all together tomorrow. The reason I was a creating a polynomial of zval is to avoid repeated calculations of tval^(j-1), and reduce the number of terms in the polynomial from expterms*ttaylorterms to just expterms.
[h^expob,h^expoa] indeed corresponds to vecab[1] and vecab[2] (maybe it would be better to name it vecba). This 'list of scalars' and the matet matrix is passed to matsba, where the individual scalars are independently multiplied with the matrix to get the B and A matrices.
@km-git-acc
Very helpful steers! I have already been able to make some good progress creating the finalmats
. Will continue tomorrow.
The latest version of Barrier V1 is now on the GitHub.
@rudolph-git-acc
Attaching a script which relies only on vectors and matrices and not on polynomials. barrier_multieval_t_agnostic_v5.txt While going along a rectangle, the matrix calculations are being done through a two stage process - mattovect and vectovalz, (which is much more efficient than the mattoval approach mentioned in the last comment). Within this, I also tried out two ways of computing the rectangle estimates (one is vectovalz which generates the estimate for one z coordinate at a time, and the other is vectovalvecz, which generates it for all z coordinates together. Interestingly the latter gives slightly worse performance (maybe because it generates somewhat large intermediate matrices)). Per rectangle (the running cost), the matrix approach seems to be taking 25% more time than the polynomial one.
@km-git-acc
Made good progress today and can now generate the finalmat[1],finalmat[2]
that exactly match your data. Nice to see how memory usage stays completely flat. Speed is however still slightly slower that with pari/gp:
y=0.2, t=0, expterms=taylorterms=50
X pari `ARB`
10^10, 22sec, 25sec
10^11, 72sec, 75sec
10^12, 230sec, 238sec
10^13, 720sec, 743sec
There must be some more speed to be squeezed of ARB :-)
Getting the stored sums produced in a reasonable time seems to be the critical factor and I am actually a bit less concerned about the speed of 'looping around the rectangles', so will try the matrix approach first. Although I did find this function:
void acb_mat_charpoly(acb_poly_t cp, const acb_mat_t mat, slong prec) Sets cp to the characteristic polynomial of mat which must be a square matrix. If the matrix has n rows, the underscore method requires space for 𝑛 + 1 output coefficients. Employs a division-free algorithm using 𝑂(𝑛4) operations.
Not fully sure yet what it actually does and whether I could potentially use it.
@rudolph-git-acc I was reading about the characteristic polynomial, which has the eigenvalues of a matrix as its roots. Not clear yet if such polynomials can be used here in some way. One thing I also realized is that while we call the stored sums (matb and mata) as matrices, we are not using them subsequently for matrix operations. Instead we use them for elementwise operations, so they are more like grids of numbers. As you had hinted several comments back, it seems to be a good idea to let expterms and ttaylorterms be the same value (higher of the two values that we feel are safe for a given X for desired accuracy), as that simplifies many of the functions, and we then have one less parameter to worry about. I was also testing the key function, matet. Computing the two taylor vectors required seems to be a much smaller bottleneck (by 2 to 3 times) than matrix multiplication between them.
nterms=50;N=10^5;
start=getwalltime();
for(i=1,N,taylorvec(Pi,nterms);mattranspose(taylorvec(Pi,nterms)));
print(getwalltime()-start);
y1=mattranspose(taylorvec(Pi,nterms));y2=taylorvec(Pi,nterms);
start=getwalltime();
for(i=1,N,y1\*y2);
print(getwalltime()-start);
Given that the matet matrix has only real values, one potential efficiency gain could come from specifying that the taylor vectors being multiplied are real. matet can also be defined in a different form
matet(xval,expterms) = {
t1=taylorvec(xval,expterms));
t2=taylorvec(xval^2/4,expterms);
resmat = matrix(expterms,expterms,i,j,t1[i]*t2[j]);
return(resmat);
}
This was slower in parigp than the vector multiplication approach, but it's possible it will be faster in arb. I guess the coolest thing would be a inbuilt function for a 'multivariate taylor matrix' M_(i,j) = p^(i-1)*q^(j-1)/((i-1)!(j-1)!) :)
@km-git-acc
Was on a trip today and have not been able to spend much time on the script, however what you propose in the last part of your comment is almost exactly how I have currently built it in ARB (except that I multiply the matrices at the end simply by resmat = t1*t2
).
I actually started off with matrices for complex numbers (ACB) and then switched back to real numbers (ARB) only (until the moment when we need to multiply with h^expob (or h^expoa). That already gave quite a speed boost. The main bottleneck I now see is the function finalmatb_addmul
that I use to firstly multiply the matrix output of matet
with the scalarh^expob
and secondly immediately add it to the previousfinalmatb
-matrix. This function seems to take most of the time (i.e. multiplying and summing 2500 matrix 'cells' for each h
. Maybe, if I would have matet
as an indexed (by the expterm) polynomial of 50 taylorterms, the multiplication and the summing of 50 polynomials of length 50 might actually be faster that this matrix-addmul
-function. Will experiment more tomorrow.
Fully agree on your 'coolest thing' comment and I have been searching for it. I found there is a function for finite 'exp-ing' an entire matrix, however that does the finite exp for all cells and doesn't store the individual terms per cell.
@rudolph-git-acc Cool. By the way, I was running some tests on the ddx and ddt bound timings since they seem to become bottlenecks at large x (using https://github.com/km-git-acc/dbn_upper_bound/blob/master/dbn_upper_bound/arb/barrier_calc_arb.c). Arb seems to be about 16x faster here, although at X=10^13, even it takes around 8 seconds. We may have to do some cost benefit calculations to determine (or come up with a heuristic) how often to update these bounds for a given X.
(millisec) | (millisec) | (millisec) | |
---|---|---|---|
X | arb (abbeff + ddx and ddt) | arb (ddx and ddt) (estimated) | gp (ddx and ddt) |
10^9 | 114 | 82 | 1278 |
10^10 | 361 | 260 | 4064 |
10^11 | 1120 | 806 | 12731 |
10^12 | 3527 | 2539 | 40174 |
10^13 | 11160 | 8035 | |
10^14 | 35406 | 25492 | |
10^15 | 114000 | 82080 |
EDIT: I was analyzing a potential heuristic for the above. Suppose a machine can 1) generate estimates for 'estrate' z coordinates per second (and that this estrate is mostly invariant or increases very slowly with X), 2) it takes ddxt_time to generate a ddx and ddt bound pair (this is dependent on X) 3) number of points computed for a rectangle is 4*ddx 4) number of times the ddx and ddt pair are updated is n (at intervals of tmax/n) 5) the min abbeff estimates stays above 2 in the entire strip (since we will choose a good strip)
if n=1, approx tottime = ddxt_time + (ddx_t0*4/estrate)*(ddt_t0*tmax) = ddxt_time + (4*tmax/estrate)*[(ddx*ddt)_t0] similarly
n=2 -> 2*ddxt_time + (1/2)(4*tmax/estrate)*[(ddx*ddt)_(t0) + (ddx*ddt)_(tmax/2)]
n=3 -> 3*ddxt_time + (1/3)(4*tmax/estrate)*[(ddx*ddt)_(t0) + (ddx*ddt)_(tmax/3) + (ddx*ddt)_(2*tmax/3)]
n=n -> n*ddxt_time + (1/n)(4*tmax/estrate)*sum_(0_to_(n-1))[(ddx*ddt)_(j*tmax/n)]
n*ddxt_time increases with increasing n, while the remaining part decreases with increasing n, hence there is an optimal n at which tottime should be minimized. Also, ddx and ddt bounds should generally follow a heuristically definable curve for a given X and y, which can be obtained by plotting values at a few t between t=0 and t=tmax. This should hopefully give us a way to compute n. Or maybe a simpler way would be to check the tottime estimate for different n and manually converge to the optimal n. estrate on my machine with parigp is about 5500 approx abbeff estimates per second.
EDIT2: Some sample time estimates at X=10^13, y=0.2,tmax=0.16
n=1 -> 120 + (4*0.16/5500)*(9494*54379) = 60195
n=2 -> 2*120 + (1/2)*(4*0.16/5500)*(9494*54379 + 265*1523) = 30301
n=4 -> 4*120 + (1/4)*(4*0.16/5500)*(9494*54379 + 1560*8864 + 265*1523 + 49*283) = 15913
n=8 -> 8*120 + (1/8)*(4*0.16/5500)*(9494*54379 + 3838*21705 + 1560*8864 + 639*3652 + 265*1523 + 112*648 + 49*283 + 23*129) = 9923
@km-git-acc
Nice analysis and we should indeed apply it to be smarter in recalculating the ddt/x bounds!
Today I have successfully built the finalmatb
and finalmata,
but now purely as indexed polynomials. The previous matet
is constructed as follows: for expterm=1, I construct a polynomial of 50 taylorterms by loading the individual coefficients. Then for each incremental index of exp
, I apply the simple polynomial arithmetic of multiplying by xval
and then dividing by the exp-index
to get the next row. So, the expterms define the index and the taylor expansions reside within the polynomial. Now I can just 'evaluate' these final polynomials using standard functions. Maybe even Taylor shifting for each polynomial could be used now. The less good news: it takes 35 secs to fill both 'finalpolys' at x=10^10, so I do need to do further optimisation.
During the testing I did set H to 100 and found to my surprise that the higher indices (e.g. expterms 40 to 50) of the finalmats are already accurate at 30+digits when H is that low (so values are equal to H-N). I wondered therefore whether we maybe don't need the full summation to H=N for each expterm and could dynamically 'shrink' the exp term-side of the finalmat matrix calculations and thereby reduce the amount of summing required. E.g. expterms 1..10: N=H, 10..20: N=H/100, etc. Or am I overlooking something here?
Edit: Not finding more speed on the polynomial approach. Now thinking about using the matrix approach to build up the finalmats
and then turning them into a poly (like you did in an earlier version of your code withmattopol
. Played a bit with that function and wondered about the following
Finalmata (x=1000, expterms =4, taylorterms=3):
0.089979 - 0.94363*I, 0.46839 - 0.030684*I, 0.11483 - 0.0027312*I
-1.5345 + 1.0443*I, -0.71922 + 0.10882*I, -0.16380 + 0.0059201*I
0.93677 - 0.061369*I, 0.45932 - 0.010925*I, 0.11069 - 0.00079315*I
-0.47948 + 0.072548*I, -0.21840 + 0.0078935*I, -0.051579 + 0.00044952*I
Output mattopol (x=1000, expterms =4, taylorterms=3):
(-0.051579 + 0.00044952*I)*zval^3 + (0.11069 - 0.00079315*I)*zval^2 + (-0.16380 + 0.0059201*I)*zval + (0.11483 - 0.0027312*I))*t^2 + ((-0.21840 + 0.0078935*I)*zval^3 + (0.45932 - 0.010925*I)*zval^2 + (-0.71922 + 0.10882*I)*zval + (0.46839 - 0.030684*I))*t + ((-0.47948 + 0.072548*I)*zval^3 + (0.93677 - 0.061369*I)*zval^2 + (-1.5345 + 1.0443*I)*zval + (0.089979 - 0.94363*I)
So, could we condense the finalmats
during the build up already to two polynomials like this:
tpoly
(0.089979 - 0.94363*I)*t^0 + (0.46839 - 0.030684*I) *t^1 + (0.11483 - 0.0027312*I)*t^2
zpoly
( -1.5345 + 1.0443*I + -0.71922 + 0.10882*I + -0.16380 + 0.0059201*I)*zval
( 0.93677 - 0.061369*I + 0.45932 - 0.010925*I + 0.11069 - 0.00079315*I)*zval^2
(-0.47948 + 0.072548*I + -0.21840 + 0.0078935*I + -0.051579 + 0.00044952*I)*zval^3
and then add the evaluations together? It would require a bit more summing of the coefficients, but the expterm indices could be dropped (and hopefully making the scalar multiplication with h^expob
faster).
@rudolph-git-acc I think in the zpoly formula above, zval and t should be entangled with each other.
tpoly
(0.089979 - 0.94363*I)*t^0 + (0.46839 - 0.030684*I) *t^1 + (0.11483 - 0.0027312*I)*t^2
zpoly
( -1.5345 + 1.0443*I*t^0 + -0.71922 + 0.10882*I*t^1 + -0.16380 + 0.0059201*I*t^2)*zval
( 0.93677 - 0.061369*I*t^0 + 0.45932 - 0.010925*I*t^1 + 0.11069 - 0.00079315*I*t^2)*zval^2
(-0.47948 + 0.072548*I*t^0 + -0.21840 + 0.0078935*I*t^1 + -0.051579 + 0.00044952*I*t^2)*zval^3
although it arrives in the desired format once a specific t is passed to it. So zpoly too will have to updated for every t.
Interesting finding on the finalmats that the terms with higher exp indices converge faster. h^expob and h^expoa do decrease (albeit slowly) with increasing h due to the -1+-y factor in expob and expoa. Also, xval = log(h/n0) ranges from log(1/n0) = -log(n0) to log(2) as h varies from 1 to N, so the lower h will have a higher impact on the finalmats. Also, values like xval^(i+2j-3)/(i!j!4^j) are expected to be quite small when i and j are close to expterms. All these should point to the behavior you observed. Although it's not clear whether a specific treshold H for each expterm can be found apriori. One way to go about it would be to calculate each cell in the finalmats matrices independently (without resorting to taylor vector and matrix multiplication techniques which compute the matrices in one go). For a given cell, once the running sum stops changing at the set precision, we could stop the summation for that cell and store the 'converged' value, although this necessitates running a check at each step of a sum.
@km-git-acc
Aha, I see now. Interestingly when I print(mattopolyt(finalmats[2],t))
I do not get the correct output (maybe because t has a value?), but when I use print(mattopolyt(finalmats[2],t1))
, it prints as you outlined above.
I believe there still is a trade-off to be made between producing the matrix (matet) for each h
and the addmul
of the resulting matrix by expoa/b
over all h
. In arb I can calculate (not sum) all matets
over all h
in just 3.9 secs, however when I switch on the addmul it takes 20 secs more. Hence reducing the matet before the addmul will give a lot of gains (especially since it is still all real numbers here). Based on my wrong assumption of being able to sum matet horizontally, I managed to half the required time.
@rudolph-git-acc Nice observation. Assuming calculating finalmats as done now takes x amount of time, its indeed taking 0.7x to just calculate (without summing) the complex matrices matsba([h^expob,h^expoa],matet(log(h/n0),expterms,ttaylorterms)) and it takes around 0.18x (around 4 sec) to calculate (without summing) all the matets.
I then checked what happens if I sum up real matrices for all h
yb = (-1-y)/2; ya = (-1+y)/2;
finalmats = sum(h=1,H,matsba([h^yb,h^ya],matet(log(h/n0),expterms,ttaylorterms)));
This takes around 0.44x or less than half the time. I think while summing up so many matrices is a pain, it gets exacerbated due to complex matrices. However, so far I cant think of a way to avoid the complex ones.
@km-git-acc
Yep, adding and multiplying complex values seems to require twice the amount of time compared to their real equivalents (which seems logical to me).
I also don't see a smart way to optimise the once-off build-up further (maybe except for your suggestion to minimise the summing of each matrix cell and stop any operation on that cell when the desired accuracy has been reached). Also maybe faster hardware could help us a bit.
The matrix method for the stored sums appears to be the fastest option for arb (only a few seconds slower on the fixed part than pari/gp, but I hope to compensate for this in the 'variable' pieces). I can now transform the finalmats into polynomials and started work on integration with the sars, afacs, etc. Hope to be able to test the overall speed later tonight :-)
Once this all works, I think we should start planning some runs.
@rudolph-git-acc Nice. We could keep increasing X by powers of 10 and think we should be good atleast till 10^15 or 10^16. The faster hardware comment got me thinking. Summing up [h^expob,h^expoa]*matet is a massively parallel problem. Although presumably beyond our scope, one could use a gpu for this, breaking the H range into much smaller parts.
@rudolph-git-acc I was wondering at very high X (eg. 10^20), when ddtbounds and ddxbounds become too difficult to calculate, is there a way to reduce the n*ddxt_time of the time heuristic.
Due to n^(-ct) factor in the bound summands, one would expect everything else remaining constant, the bounds should decrease exponentially as t increases. Indeed this is what we see. For eg. at 10^13,
If instead of following the curves, we computed the ddx and ddt bounds only at t0 and tmax, and then followed the blue line bound_t0 + m*t, where m=(bound_tmax-bound_t0)/tmax, we could keep updating the intermediate bounds at almost no cost, and our overall bound calculation time would become almost 2*ddxt_time. The disadvantage ofcourse is in the initial part of the t range, where the linear bound decreases slower than the curved bound, resulting in more rectangles with larger number of mesh points getting processed. However, as X is increased, the tradeoff should flip after a point.
@km-git-acc
Nice! I like your thinking on "bounding the bounds" this way.
I have the arb-code for the Barrier v2 working now with exactly the same output as pari/gp. Already ran a few tests that clearly support the need to frequently recalculate the bounds to reduce the number of rectangles:
For 6*10^10+2099, y0=t0=0.2, expterms=tayterms=50, Barrier v2 completed:
Will now do some clean up and expect to share the script later today (hopefully with less errors that the previous one)!
@km-git-acc
Below is the Barrier V2 script. Grateful if you could give it a try. Note that the numbers will be slightly different from your pari/gp script since I replaced the xNp1's by xN's and also pass X (and N) directly to the ddx/ddt procedures (so no more xN calculation within the procedures).
https://github.com/km-git-acc/dbn_upper_bound/blob/master/dbn_upper_bound/arb/BarrierV2.c
There might be a few opportunities to improve speed that I will experiment with later tonight.
@rudolph-git-acc It's working and is quite fast :) I noticed the bounds decrease at each t step, although at a regular pace. For eg for X=6*10^10, y=0.2, it reached t=0.01 at rectangle 515. Is this the linear bound in action? (EDIT: 1751 rectangles overall)
I was also wondering whether arg(z1/z2) can be computed directly, but that looks difficult.
@km-git-acc
Nice :-) I also considered calculating the arg
immediately during the build up of the ests
, however I run into a problem with the sequence (esp. the sides that need to be 'crawled' backwards). What could be done though is to establish the minmodabb
directly and saving a loop up till the rectsize
(that is size 11072 at 6*10^10) at the end. Have already identified quite a few small improvements in the code for calculations of sarr, afac, etc.
I am planning to test a x=10^13 run tonight, just to test how it performs.
@rudolph-git-acc By the way, do you know how to merge complex conflicts in Github? Prof. Tao opened a pull request about an hour back. I haven't been able to merge it like earlier. https://github.com/km-git-acc/dbn_upper_bound/pull/90 EDIT: The conflict was in a pdf file which I guess github can't understand like it does text files. I copied that pdf file from his repo and uploaded it here to resolve it.
@rudolph-git-acc I was just testing how the script runs at X=10^14 and since the stored sums will take 30 to 45 minutes to compute, I realized there is a further efficiency gain possible especially at very high X, although in a non-algorithmic way.
If we allow the prt input to have another value (say -1) which tells the script to calculate and print only finalmatb and finalmata (n0 is just N/2 so can be easily derived from X), we can direct that output to a file, something like ./barrier 100000000000000 0.2 0.15 50 50 -1 > storedsums_X_10pow14_y_0.2.txt This will be a small enough file and we can share it among ourselves.
During the eval, the script can then allow an additional optional input, path of a storedsums file. Then we can divide the t range among ourselves and run the script (EDIT: Just realized for division of the t range to be possible the script will also have to accept two t values)
EDIT2: As a test, I saved the storedsums at X=10^9,10^10 and 10^11, y =0.2 storedsums_X_10pow9_y_0.2.txt.gz storedsums_X_10pow10_y_0.2.txt.gz storedsums_X_10pow11_y_0.2.txt.gz Calculated it with precision of 100 digits and 100 expterms*100 ttaylorterms (timings increase around 9 times by doing this). finalmatb and finalmata are separated by 3 blank lines. The same file should be usable with even less expterms or taylorterms since we could just extract the top left sub matrices. Saving more than needed should provide flexibility in case more accuracy is required in some future application. (output format between parigp and arb may differ although they should be interchangeable with some find and replace operations) (the arb output format looks much cleaner)
@km-git-acc
Great idea. You are a very creative person, km!
Such a prt-switch can be easily added to the script. Reading a file back into the t-loop will take me a bit longer, since I haven't processed files in arb yet. Adding a range of t's to be processed is easy again. I also will look into C-code that allows partitioning of the software to different cores of the CPU. I have seen this work in Python (with a standard package) and the parallel processing gave great speed boosts.
I ran x=10^13, y=t=0.2, terms=50 and it completed in 7 hours (2746 rectangles). I obviously did not pick an optimal location (like the +2099) yet, so do expect time can be reduced. For your pre-calc idea above, we also should do the stored sum runs for 'good' barrier spots. Maybe this should now be our priority, to find some promising Barrier locations for all 10^x
's we want to process. Wasn't there a routine to help find these spots?
Last night I finetuned the V2 script and optimised a large number of calculations like (1-y)/2, t/4 etc. Code got much simpler, but guess what: the overall processing time at 6*10^10 became a second slower...
@rudolph-git-acc Thanks. The feeling is mutual. This is a function I had used earlier to find promising offset candidates near X.
barrier_strip_search(bfile,X,numx=10^5,nprimes=5) = {
p = primes(nprimes);
offset = [-0.5,0,0.5];
for(tht=0,numx,x=X+tht+0.5;\
aty0 = vector(3,i,abs(prod(v=1,nprimes,1/(1-1/p[v]^(1/2-I*(x+offset[i])/2)))));\
aty1 = vector(3,i,abs(prod(v=1,nprimes,1/(1-1/p[v]^(1-I*(x+offset[i])/2)))));\
row = concat([tht+0.5],aty0); row = concat(row,aty1);\
print(tht); write(bfile,row);\
);
}
filename = "~/large_abbeff_X_candidates.txt";
X=10^13; numx=10^4; nprimes=4;
barrier_strip_search(filename,X,numx,nprimes)
It computes the Euler product of the first few primes at X+offset-0.5,X+offset,X+offset+0.5 and at y=0 and y=1. I have set offset to be an integer+0.5 so the strip endpoints are integers. (This is different from earlier when offset was chosen as integer resulting in the endpoints being [6*10^10+83951.5,6*10^10+83952.5] but the decimals tend to cause confusion). So for each offset value, 6 euler products are calculated of which the last 3 which assume y=1 are more important, since abbeff has a general tendency to decrease as y (or t) increase. Maximizing the last 3 generally ensures the first three also stay high. I generally use this output in a spreadsheet and apply several filters (for eg. the last 3 columns should each be greater than 4 or if that doesn't work then 3.5, or the product of the last 3 columns should be large, etc.) till I am left with a few candidates. This slicing and dicing I found to be easier in a spreadsheet so haven't created a script for it. Then the candidates can be checked for the actual estimates. For eg. for 10^13+[3795,3796],
abs(abbeff(10^13+3795,0.2,0))
abs(abbeff(10^13+3795.5,0.2,0))
abs(abbeff(10^13+3796,0.2,0))
abs(abbeff(10^13+3795,1,0.16))
abs(abbeff(10^13+3795.5,1,0.16))
abs(abbeff(10^13+3796,1,0.16))
Also importantly, I realized there is a nuance in how we run the barrier scripts. Since we set num_mesh_points of each side as ceiling(ddxbound), the gap between two consecutive points is atleast about 1/ddxbound. The gap should ideally be minmodabb/ddxbound so the barrier strip has to be chosen where abs(abbeff) even at y=1,t=tmax is expected to stay above 1. We could edit the script and increase num_mesh_points if we do not find such a offset near some X, although such a scenario seems unlikely. We could for example keep increasing the numx parameter till we find a offset which has the desired behavior.
Interesting that the script got slightly slower on optimization. Not sure when that happens but I have noticed sometimes that assigning values to intermediate variables can be slower than having a single formula (for eg. b2=b1^2; b3=b2*\b1; vs b3=b1^3) (although the former generally works if the intermediate variable approach eliminates expensive calculations like powers and exponentials).
@km-git-acc
Great. Will try that out!
I have prepared a separate script to generate the stored sums (rather than using the prt
-switch). I have set the precision at such a value that it always achieves at least 100 digits accuracy (calibrated with pari/gp). This will make the computation time a bit longer, but since we only have to build these stored sums once, that shouldn't matter too much. Will now commence work on reading these files back into the t-loop script.
https://github.com/km-git-acc/dbn_upper_bound/blob/master/dbn_upper_bound/arb/Barrierstoredsums.c
The output is both finalmatb
and finalmata
separated by 3 lines. Have set t0=0 (t is only required for the calculation of N).
Let's now try to find a few promising Barrier-spots and get these rolling :-)
EDIT: Prof Tao made a comment earlier that we do need to show that we have selected sufficient Taylor expansion terms to achieve the desired accuracy. I was thinking about the following automated test we could do in our script: pick 4 'expanded' mesh points from the rectangle output at t=0 and compare this with abbeff(x,y,t) for that point. This way we could automatically assure we have achieved the (pre-set) accuracy level for that Barrier calculation. Could that work?
@rudolph-git-acc The stored sums script is working too. While comparing the exact and approximate estimates at a few points should work, I was also thinking of whether a more formal analysis (but short of a proof) can be done so that we know the approximate number of taylorterms required for a given accuracy.
Firstly, it would be great if the tvals and zvals passed to the finalmatb and finalmata matrices have absolute values less than 1, as then they do not complicate matters when we take their powers. With tval we know this already. [a,b]expo[v] + (t/2)log(n0) is approximately (some thtarr component)+(some zarr component) - (t/2)logN + (t/2)log(N/2) +- it(Pi/8) = approx (some thtarr comp) + (some zarr comp) - (t/2)log(2) +- 0.4it which does seem to be less than 1 in magnitude at the t we care about. In fact with t <= 0.2, abs([a,b]expo[v] + (t/2)log(n0)) should not cross sqrt((0.25+0.07)^2 + (0.25+0.08)^2) = approx 0.46 approx 0.5. So an additional power of zval would have contributed atmost about 1/2 of the last zval power in the taylor expansion. An additional tval power would contribute atmost 1/5 correspondingly.
Secondly, the taylor sequence x^n/n! for a given x peaks near n=floor(x) and then starts decreasing (this can be seen for eg. by analyzing when f(n)=x^n/n! - x^(n+1)/(n+1)! > 0). Beyond that addition of more taylor terms contributes less and less to exp(x). For matet, the largest xval by abs magnitude is log(1/n0) = -log(n0)..(at h=1) (log(n0)^(i)/i!)((log(n0)^2/4)^j/j!) should start decreasing (probably earlier) beyond [i,j]=[floor(log(n0)),floor(log(n0)^2/4)]. Ofcourse a decrease alone is not enough, we want the last coefficient in the matrices to be smaller than our desired accuracy. Also, given that h^expob and h^expoa should decrease slowly with increasing h, if we assume they are of unit magnitude, we are adding N matet matrices, hence it is safer to have the last coefficient further smaller by N times.
Thirdly, in the n0^([a,b]expo[v]+logtn0/2) factor, the real part of the exponent should be greatest at t=0 (with value atmost 0.25)
So it seems when (n0^0.25)N(log(n0)^(imax)/imax!)((log(n0)^2/4)^jmax/jmax!)(0.5^imax)(0.2^jmax)...(for t<=0.2) is less than the desired accuracy for some (imax,jmax) pair (it could be there are multiple such pairs), we should be fine.
ttermmagnitude(X,imax,jmax) = {
N=floor(sqrt(X/4/Pi)); n0=N/2;
return((n0^0.25)*N*(log(n0)^imax/factorial(imax))*((log(n0)^2/4)^jmax/factorial(jmax))*(0.5^imax)*(0.2^jmax));
}
Does this seem right? Assuming we are targeting an accuracy of 10^-30, the (50,50) matrices should work according to this till about X=10^14. Also, when t is not a well rounded fraction like 0.05 or 0.1 or 0.2, but a random looking number, it is possible we may see accuracy loss in the printed output from what we expected, due to rounding off errors. So it would be better to measure the abbeff accuracy at well rounded t.
I will also look for a better X candidate compared to the one in the last comment by searching among more offsets.
EDIT: Actually, it seems the below version is a better function to determine the number of taylorterms required. This looks for the largest coefficient along the imax and jmax borders in the matet matrix, instead of only the last coefficient. At X=10^11, it recommends a (63,63) matrix (which is conservative and somewhat lower may also work) for 10^-30 accuracy. A (50,50) matrix leads to the 30th digit in the approximation being different from the actual, hence is not enough.
ttermmagnitude(X,imax,jmax) = {
N=floor(sqrt(X/4/Pi)); n0=N/2; x1 = log(n0); x2 = x1^2/4;
print([floor(log(n0)),floor(log(n0)^2/4)]);
preterm = (n0^0.25)*N;
borderimax = vecmax(vector(jmax,j,(x1^(imax-1)/factorial(imax-1))*(x2^(j-1)/factorial(j-1))*(0.5^(imax-1))*(0.2^(j-1))));
borderjmax = vecmax(vector(imax,i,(x1^(i-1)/factorial(i-1))*(x2^(jmax-1)/factorial(jmax-1))*(0.5^(i-1))*(0.2^(jmax-1))));
bordermax = max(borderimax,borderjmax);
print(preterm*bordermax);
}
EDIT2: Evaluated first 10^5 barrier candidates near X=10^13. Chose 5 after adhoc filtering and within those the best one was 10^13+[23447,23448]. minmodabb should start near 4 and end near 1.55.
./abbeff 10000000023447 1.0 0.16 10 -> 1.544543052
./abbeff 10000000023447.5 1.0 0.16 10 -> 1.579327953
./abbeff 10000000023448 1.0 0.16 10 -> 1.5429656
./abbeff 10000000023447 1.0 0.00 10 -> 4.012530414
./abbeff 10000000023447.5 1.0 0.00 10 -> 5.153041387
./abbeff 10000000023448 1.0 0.00 10 -> 4.938758078
./abbeff 10000000023447 0.2 0.00 10 -> 10.94343241
./abbeff 10000000023447.5 0.2 0.00 10 -> 22.54487384
./abbeff 10000000023448 0.2 0.00 10 -> 29.18534561
@km-git-acc
This looks sound to me!
Making good progress on the t-loop file input process however encountered an interesting issue in the output of this standard act_mat_printd function:
[(1.658494482 - 0.331818635j) +/- (2.52e-116, 2.52e-116j), (1.681861494 - 0.1661787685j) +/- (3.3e-117, 3.28e-117j)]
Spotted too late that using a ,
delimiter won't work with this type of output. Have now decided to first print a proper csv of the matrices and also store the X, y0 that have been used for this output. This keeps the input process much simpler and automatically guarantees the correct X,y0 will be used in the t-loop. The input parameters for the t-loop will therefore now only be:
t-start, t-end, expterms, taylorterms, inputfile, prt
The exp/taylorterms parameters will allow selecting any subset from the 100x100 stored sums.
Grateful if you could remove the current version of the Stored Sums maker
from Github (it also still incorporated a print of y that I had forgotten to remove).
@rudolph-git-acc Removed the file as you suggested. Also, running the barrierv2.c with X=10^13+23447,y=0.2,t=0.16 with 60 terms each (which the argument above indicates should give above 15 digits accuracy). I have been wondering what kind of accuracy we should target (since it now has a direct bearing over both the stored sums and eval computation time). Given that abbeff is expected to be of the order of magnitude 1, 30 digits seems a bit high.
Also it seems 100x100 matrices can handle X=10^18, with around 20 digits accuracy.
@km-git-acc
Great. Agree that we shouldn't go 'over the top' on accuracy achieved. Please note that I found that in the printing of the detailed output in Barrier V2, it appears the conversion to double before printing causes inaccuracies in the output after 20 digits. I know how to fix this and will post on updated version of Barrier V2 later tonight (maybe you could already remove the current version from GitHub as well). Also, have the file out and back-in process working now and currently testing.
@km-git-acc
I have uploaded three new versions at GitHub in the Arb-section.
The first is an updated version of BarrierV2 (improved the detailed printing)
The other two are a script to print (to a file) stored sums of any size. Sample parameters:
./Barrierstoredsums 60000002099 0.2 100 100 100 >testfile.txt
X y0 expterms taylorterms digits >output file (contains: X,y0,finalmatb,finalmata)
and a script to process the rectangles over a choice of t-ranges taking the above files as input:
./BarrierTloop 0.0 0.2 50 50 0 testfile.txt
t-start t-end expterms taylorterms prt inputfile
The expterms and taylorterms can be chosen smaller than the storedsums resulting in smaller matrices to be filled. I opted for ranges of t, so we could split the computations amongst ourselves (or CPU-cores or clusters).
I think your idea to split the functionality this way works out really great! The second part is now even faster than the combined version: i.e. 7mins (BarrierTloop) instead of 8mins (BarrierV2). The build up of the stored sums in the combined version took roughly 25 secs forx=6*10^6+2099, exp=taylor=50
not sure where the other 35 secs came from. Guess a lot of additional initialisation of matrices etc. no longer has to be done once you simply read from a file.
Let me know if it works at your end and also if you see further optimizations/requirements to be built in (I could of course add plenty of extra validations to make it all more robust, but haven't spent to much time on that).
@rudolph-git-acc I have verified the updated V2 script gives more accurate output. Removed the older file and renamed the new file as the older file.
The stored sums and t loop processor scripts are also working. I first saved 30x30 storedsums for X=10^8,y=0.2 with 50 digit accuracy.
Then checked whether the t loop loads the file and picks submatrices and the increase in accuracy as we increase the submatrix size.
(10,10) -> 0, 0.2, 100000000, 3.3129750876883001210 - 3.8905227423994721409*I
(20,20) -> 0, 0.2, 100000000, 3.3107042792322744586 - 3.8845193178991254385*I
(30,30) -> 0, 0.2, 100000000, 3.3107042792584674844 - 3.8845193179258757020*I
(A+B)/B0 eff actual : 3.3107042792584674844 - 3.8845193179258757020*I
Also as I had mentioned earlier it's better to check the accuracy at well rounded X,y,t. For example at intermediate values, things are more complicated due to what is computed and what is printed.
(10,10) -> 0, 0.2, 100000000.00183150183, 3.3340106512289288842 - 3.9009662219673327936*I
(20,20) -> 0, 0.2, 100000000.00183150183, 3.3316492163887202356 - 3.8951502601946303872*I
(30,30) -> 0, 0.2, 100000000.00183150183, 3.3316492164148254952 - 3.8951502602191327091*I
(A+B)/B0 eff actual : 3.3316492163975638339 - 3.8951502602104860461*I
Also, as you mentioned we can now distribute the t loop part between multiple machines and between ourselves. It is also quite amenable to grid computing, where the central machine will send the same storedsums file to each participant, assign it a t range and get back the results (ofcourse in that framework, even the storedsums calculation can be parallelized, resulting in a chain of central+distributed tasks).
Also using the earlier v2 script I was able to complete the barrier X=10^13+23447,y=0.2,t=0.16 and 60x60 stored sums in 140 minutes or about 2.5 hours (744 rectangles). So choosing a good barrier spot does provide a nice speed boost. I think about 2 hours out of this was spent on processing the rectangles (or about 10 seconds per rectangle) of which majority (around 8 seconds per rectangle) must have been on the ddx and ddt bounds. barrier_summary_X_10pow13_23447_y_0.2_t_0.16.txt
In terms of further optimization, one thing which comes to mind is now that we can parallelize the t loop, we could do a much better approximation of the ddx,t bounds using multiple linear segments. For eg. the green segments instead of the blue line.
EDIT: Have also asked about the taylorterms argument on the blog, https://terrytao.wordpress.com/2018/05/04/polymath15-ninth-thread-going-below-0-22/#comment-501072
@km-git-acc
Just back behind my machine.
Your results in the 10^13 domain look very promising and seem to bring us close to a successful Barrier computation (conditional on RH verification up to that height).
I actually hadn't appreciated the influence of more random like test numbers on the accuracy of the output. Your examples illustrate this clearly.
I will try to incorporate the multiple linear segment approach for the ddt/x-bounds tomorrow. That looks very feasible and definitely will make things faster (and not only at 10^20). Which formula did you use to establish the recalc-points? I guess it is something that follows the shape of curve as closely as possible. Maybe we could always split the chosen t-range into 6 line-segments and make the recalc points tend/32, tend/16, tend/8, tend/4, tend/2, tend/1
. Would that effectively 'meter' the real curve?
Will also aim to build the barrier_location_search
routine as you explained it a few posts back.
P.S.: today I ran as a test : ./Barrierstoredsums 1000000000000000 0.2 100 100 100 >test10to15.txt
and it successfully completed in 15 hours and 30 minutes (i.e. X=10^15).
@rudolph-git-acc Actually, I had originally thought that once the t loop script is submitted a t range it will use only one line segment for that t range. So if two people/machines are covering the entire range, there will be two line segments. But what you described is a further optimization where even for individual runs, the t range can be divided internally by the script. tend/32, tend/16, tend/8, tend/4, tend/2, tend/1 looks good as it will follow the curve more at smaller t, where it is needed more. Also, maybe we should have the line segment version as a separate version (eg. v4). I think the current t loop script (v3) which uses exact ddt,x bounds should be retained for anyone interested in knowing what is the minimum number of t steps or rectangles required, irrespective of the computation time.
10^15 is impressive. I too will start searching for a good barrier location near this. Also, since the 10^13 + 23447 barrier strip was validated, I will run again the bounds script and share the output of that as well (since the last run had been with a non-conservative version). Once the error bounds are also computed, the 10^13 exercise should get completed.
One more optimization I was able to think of (it is a speculation based approach so may border on overoptimization). Right now the number of mesh points for a rectangle side is taken as ceil(ddxbound) since we expect minmodabb to be above 1 (which we would have almost certainly ensured through the barrier location search). But near t=0, minmodabb often starts higher (near 3 or 4 or more), and decreases slowly as t increases. At the next t step (arrived in the usual fashion of minmodabb/(2*ddtb), we don't know in advance what the new minmodabb will be, but we could just discount the latest minmodabb by a certain amount (eg. minmodabb-0.5 or max(1,minmodabb-0.5)) and calculate the number of mesh points required (ceil(ddxbound/discounted minmodabb)). If after generating the abbeff estimates, the new minmodabb turns out to be higher than the discounted value, our assumption holds.
@km-git-acc
The linear segment approach has now been incorporated in a new version called BarrierTloop-linearbounds and has been uploaded to GitHub. I have added a parameter to chose the number of line segments desired, hence your original idea is equivalent to selecting #line segments = 1. If the line segment parameter is set to 0, then the bounds are calculated for each t-step. If line segments is set to e.g. 6, I establish 7 t-values (via the div by power of 2 method) and for each threshold value pre-calculate and store the ddx/ddt bounds (so bit more time required upfront and thinking about it; since the t-values are now 'clean' recalculation threshold values fully determined by ourselves, all these computations could be done once-off as well, right?). Based on the values at the beginning and the end of a segment I establish and use the linear equation dtabb=at + b
until the next threshold.
I actually had really understood that this was the optimisation that you hinted at in your previous comment; at least the graph you attached clearly suggests such an approach. Maybe it was a subliminal message :-)
Splitting the task amongst multiple people should remain perfectly feasible with the optimised model and actually would assign 'more equally sized' computation tasks to each contributor compared to the single line approach (that will put the biggest computing burden on the user with the chunk closest to t=0, which of course could be optimised by splitting the single line in differently sized chunks, but hey, that is the optimised model again).
Already did some runs and the speed boost from this trade-off looks great: 6*10^10+2099, exp=tay=50, 6 line-segments
completed in just 2.5 mins (481 instead of 440 rectangles from the original 7 minutes "always recalculate" scenario). With only 1 line-segment selected, it actually took 11 minutes (and 1450 rectangles), so the trade-off at this height works out unfavourable. Played a bit with the parameters to find the sweet spot and found that at this height 6 or 7 segment look to be the optimal trade-off. Wondered if we can do even better e.g. by changing the div by powers of 2
approach to something that follows the exponentially shaped defining curve even better. In the output, I have marked the rectangle for which the bound recalculation occurs with a little arrow (after which you'll immediately see a speed accelleration).
P.S 1.:
In pari/gp there exists a fast function to evaluate a finite eulerproduct called prodeuler
, so you might consider using this in your Barrier search script (if speed is an issue for that routine of course).
P.S. 2:
Here is the StoreSums output for 6*10^10+2099 that used for testing. Always the same size, whatever the height :-)
test.txt
@rudolph-git-acc 2.5 min is amazing. Given that when the 6*10^10 magnitude was first targeted, it took days, this is a story in itself. I got the corresponding run time as 2 min 50 sec.
Something other than powers of 2 may be possible, although the ddx and ddtbound expressions look quite difficult to analyze. Maybe differentiating them wrt t could give some insight.
During experimentation with the script, I noticed that the tend/2^k approach works only when ts=0. If ts is set to a higher value (eg. 0.1), some of the recalc points probably go below ts causing an abort message. This should be fixable with ts + (te-ts)/2^k as the recalc points.
prodeuler somehow did not give a speed boost, although the expression now looks more intuitive.
barrier_strip_search2(bfile,X,numx=10^5,nprimes=5) = {
nthprime = prime(nprimes);
offset = [-0.5,0,0.5];
for(tht=0,numx,x=X+tht+0.5;\
aty0 = vector(3,i,abs(prodeuler(p=2,nthprime,1/(1-1/p^(1/2-I*(x+offset[i])/2)))));\
aty1 = vector(3,i,abs(prodeuler(p=2,nthprime,1/(1-1/p^(1-I*(x+offset[i])/2)))));\
row = concat([tht+0.5],aty0); row = concat(row,aty1);\
print(tht); write(bfile,row);\
);
}
Yeah, the storedsums output is like a signature of its corresponding barrier region. I am planning to run the 10^13+23447 with the latest script again to check the time gains. With V2 the t loop part had taken about 2 hours.
@km-git-acc
Thanks for spotting the error. I have fixed it now and uploaded a c1 version on GitHub.
I have now also created a Barrier_search script and also placed it on GitHub. It works as follows:
./Barrier_strip_search 1000000 12 4 2 0.2 0.16 0
inputs respectively: x, numx, nprimes, threshold, y0, t0, switch
.
The script only outputs a line when the last three (y=1 case) fields are all above the threshold
.
When the switch is set to 1, the script also calculates and outputs for the corresponding 6 abbeff values for y=(y0, 1) and t=(0, t0). The script approximates the last three values quite reasonable with only 30 primes (use f.i.: ./Barrier_strip_search 1000000 12 30 2 0.2 0 1
).
The idea to use it is to identify a handful promising locations by reducing the threshold and only then switching on the abbeff data for that limited set.
Edit 1: just spotted that the number of digits to print X at larger heights was set too small. Please use the c1 version (and just remove the previous one).
Edit 2: spotted and fixed an error in the threshold (it takes the floor of the parameter) and also moved the switch-parameter to the end of the input sequence (for ease of use). Script uploaded as c2.
Tried it for the 10^13 domain and it also does find the +23447
:-)
10000000023447.5, 13.35640586, 16.94953454, 12.39350419, 3.698714942, 4.031050259, 3.682832972,
-> 10000000023447.0000000000000000, 0.2000000000, 0, 10.94343241
-> 10000000023447.5000000000000000, 0.2000000000, 0, 22.54487384
-> 10000000023448.0000000000000000, 0.2000000000, 0, 29.18534561
-> 10000000023447.0000000000000000, 1.000000000, 0, 4.012530414
-> 10000000023447.5000000000000000, 1.000000000, 0, 5.153041387
-> 10000000023448.0000000000000000, 1.000000000, 0, 4.938758078
And for the 10^15 domain, I believe we should opt for this one:
1000000000021104.5, 15.93593154, 23.21281313, 14.61689083, 3.979260323, 4.364420778, 3.901149810,
-> 1000000000021104.00000000000000, 0.2000000000, 0, 8.800734191
-> 1000000000021104.50000000000000, 0.2000000000, 0, 20.36479545
-> 1000000000021105.00000000000000, 0.2000000000, 0, 28.46237039
-> 1000000000021104.00000000000000, 1.000000000, 0, 4.253271506
-> 1000000000021104.50000000000000, 1.000000000, 0, 5.171327824
-> 1000000000021105.00000000000000, 1.000000000, 0, 4.874236176
Do you get the same data?
@rudolph-git-acc The barrier search script is quite productive :) much better than the adhoc spreadsheet approach. Following the 2nd example, I used 30 primes for the 10^13 and 10^15 exercises. Also, I first tried a high threshold (eg. 5), didnt get any results, and kept reducing the threshold by 0.5 till I got some results. Near 10^13, the 9995 offset seems to be even better. Near 10^15, 21104 and 43294 compete closely with each other.
./Barrier_strip_search 10000000000000 50000 30 5.0 0.2 0 0
./Barrier_strip_search 10000000000000 50000 30 4.5 0.2 0 0
10000000009995.5000000000000000, 31.22384913, 102.1781932, 58.12198448, 4.735985233, 5.792124833, 4.799261598,
./Barrier_strip_search 10000000000000 50000 30 4.0 0.2 0 0
10000000009995.5000000000000000, 31.22384913, 102.1781932, 58.12198448, 4.735985233, 5.792124833, 4.799261598,
10000000017247.5000000000000000, 19.50353402, 41.48351114, 25.24284057, 4.007949872, 4.804129870, 4.359622380,
10000000023447.5000000000000000, 19.19100255, 75.41612090, 54.92067967, 4.073843011, 5.308319187, 4.864758202,
./Barrier_strip_search 10000000009995 1 30 4.5 0.2 0 1
10000000009995.5000000000000000, 31.22384913, 102.1781932, 58.12198448, 4.735985233, 5.792124833, 4.799261598,
-> 10000000009995.0000000000000000, 0.2000000000, 0, 37.19386243
-> 10000000009995.5000000000000000, 0.2000000000, 0, 55.19403281
-> 10000000009996.0000000000000000, 0.2000000000, 0, 12.15480737
-> 10000000009995.0000000000000000, 1.000000000, 0, 5.032672517
-> 10000000009995.5000000000000000, 1.000000000, 0, 5.922356219
-> 10000000009996.0000000000000000, 1.000000000, 0, 4.514604323
./Barrier_strip_search 10000000009995 1 30 4.5 0.2 0.16 1
10000000009995.5000000000000000, 31.22384913, 102.1781932, 58.12198448, 4.735985233, 5.792124833, 4.799261598,
-> 10000000009995.0000000000000000, 0.2000000000, 0, 37.19386243
-> 10000000009995.5000000000000000, 0.2000000000, 0, 55.19403281
-> 10000000009996.0000000000000000, 0.2000000000, 0, 12.15480737
-> 10000000009995.0000000000000000, 1.000000000, 0.1600000000, 1.564782511
-> 10000000009995.5000000000000000, 1.000000000, 0.1600000000, 1.582115609
-> 10000000009996.0000000000000000, 1.000000000, 0.1600000000, 1.524421713
./Barrier_strip_search 1000000000000000 50000 30 5.0 0.2 0 0
./Barrier_strip_search 1000000000000000 50000 30 4.5 0.2 0 0
./Barrier_strip_search 1000000000000000 50000 30 4.0 0.2 0 0
1000000000021104.50000000000000, 25.19556325, 47.89777820, 29.70001840, 4.412177355, 5.287588683, 4.750221849,
1000000000043294.50000000000000, 33.02914324, 60.12526624, 29.05851743, 4.622190818, 5.316723815, 4.420978007,
1000000000049712.50000000000000, 19.74153936, 50.68331711, 38.30126601, 4.171285282, 5.166144572, 4.733247879,
./Barrier_strip_search 1000000000021104 1 30 4.0 0.2 0 1
1000000000021104.50000000000000, 25.19556325, 47.89777820, 29.70001840, 4.412177355, 5.287588683, 4.750221849,
-> 1000000000021104.00000000000000, 0.2000000000, 0, 8.800734191
-> 1000000000021104.50000000000000, 0.2000000000, 0, 20.36479545
-> 1000000000021105.00000000000000, 0.2000000000, 0, 28.46237039
-> 1000000000021104.00000000000000, 1.000000000, 0, 4.253271506
-> 1000000000021104.50000000000000, 1.000000000, 0, 5.171327824
-> 1000000000021105.00000000000000, 1.000000000, 0, 4.874236176
./Barrier_strip_search 1000000000043294 1 30 4.0 0.2 0 1
1000000000043294.50000000000000, 33.02914324, 60.12526624, 29.05851743, 4.622190818, 5.316723815, 4.420978007,
-> 1000000000043294.00000000000000, 0.2000000000, 0, 18.24042319
-> 1000000000043294.50000000000000, 0.2000000000, 0, 44.16859928
-> 1000000000043295.00000000000000, 0.2000000000, 0, 11.14111455
-> 1000000000043294.00000000000000, 1.000000000, 0, 4.718792030
-> 1000000000043294.50000000000000, 1.000000000, 0, 5.416209081
-> 1000000000043295.00000000000000, 1.000000000, 0, 4.286729295
One thing I have realized after many experiments is that almost certainly the y=1 abbeff estimates decide minmodabb. I have yet to see a case where the y=0 estimates are lower than the y=1 estimates. In the detailed output when switch=1, we may be able to combine the last 2 queries run for example for 10^13+9995 above, giving 3 values each for y=1, t=0 and y=1,t=tmax.
Also, the 6 abbeff estimates at 10^15 as expected take more time. Since during the location search the accuracy of approximation is not neccessary, I was thinking whether we can just sum the first 100k or 200k summands, and get almost the same abs(abbeff) estimate.
With 100k summands, at t=0,y=1, x=10^13 + 23447 + [0,0.5,1], approx abs(abbeff) -> [4.01,5.16,4.93] which is pretty close to the original.
@rudolph-git-acc Also, it seems the speculative approach to leverage the high minmodabbs near t=0 can be tweaked and made deterministic. The below took me some hours to sort out as I had to go back to some of the basics. Does this look sound..
Firstly, going back to how the Lagrange bounds work,
Case 1: If at the beginning of an interval, the function value is F1, and we search next at F1/ddxb since it could be zero there. But if it turns out to be F2 there, then the minimum in the interval is at x where F1-ddxb*x = F2 - ddxb*(F1/ddxb - x) solving gives ddxb*x = F1 - F2/2 hence the minimum in the interval is F2/2
Case 2: If at the beginning of an interval, the function value is F1, and we instead search next at 1/ddxb and it turns out to be F2 there, then the minimum in the interval is at x where F1-ddxb*x = F2 - ddxb*(1/ddxb - x) solving gives ddxb*x = (F1 - F2 + 1)/2 hence the minimum in the interval is (F1+F2-1)/2
Although with the multieval approach, we are following Case 2, the minimum we assume for the interval is that of Case 1. Hence while choosing the next t step, we take t_next = t_curr + (minmodabb/2)/ddtb whereas we could take t_next = t_curr + (minmodabb+minmodabb - 1)/2/ddtb = t_curr + (minmodabb-0.5)/ddtb. Rectangle traversal undergoes no change.
Near t=0, where minmodabb is high, this should result in almost double the t step size than earlier. For eg. if minmodabb is 4 and ddtb is 100, we would have the next t as t_curr + 0.035 instead of t_curr+0.02. Essentially, the extra work we do within a rectangle can be leveraged to choose a farther rectangle.
@km-git-acc
Great idea on using just a limited number of terms for abbeff. I have changed the Barrier_script such that if switch > 1, then the N-term in abbeff becomes (sw-1)*100000
. I called it V1 now, and all the old versions can be deleted.
The new script allowed me to now very quickly compute the following optimised barrier locations (details in the document attached):
X offset y0 t0 Lambda mollifier
6*10^10,+155019, 0.20, 0.20, 0.22, euler3
10^11, +78031, 0.20, 0.18, 0.20, euler3
10^12, +46880, 0.20, 0.17, 0.19, euler3
10^13, +9995, 0.20, 0.16, 0.18, euler3
10^14, +2659, 0.20, 0.15, 0.17, euler3
10^15, +21104, 0.20, 0.14, 0.16, euler3
10^16,+172302, 0.20, 0.13, 0.15, euler3
10^17, +31656, 0.20, 0.12, 0.14, euler2
10^18, +44592, 0.20, 0.11, 0.13, tbd
10^19, +12010, 0.20, 0.10, 0.12, tbd
10^20, +37221, 0.20, 0.09, 0.11, tbd
Detailed data supporting Barrier location choice.txt
I haven't been able to verify yet whether the abbeff-lower bound is positive > x=10^17 for the proposed y0, t0 combinations. Memory limitations in the bound-script at these heights still prevent this.
Note that even for 6*10^10 there appears to be a better location. To proof this is indeed the case, I first ran the stored sums and then BarrierTloop for x = 60000155019, y0=t0=0.2, terms=50, 6 line segments
and the latter completed in just 89 secs (295 rectangles) compared to the 150 secs earlier at 60000002099.
If you could confirm the above locations, I think we are in a position to start computing the 100x100 stored sums for these locations and store them on GitHub in a dedicated directory.
When prof Tao confirms your proposed algorithm to assure that sufficient taylorterms have been selected for the desired accuracy, I wonder whether we could further simplify the process by having the prescribed number of taylorterms as automatic 'stored' output from the store sums-script and disallow the terms to be reentered as input to the T-Loop script. This would of course require an extra parameter 'desired T-loop accuracy' for the stored-sums process and a change to the script to compute the required number of terms, however it would then guarantee sufficient terms are being used (for a now fixed level of accuracy). Also, still wondering about pre-computing the now limited set of ddx/ddt bounds already in the stored-sums process. The number of line-segments to be applied would then be an additional parameter in this script and the t-values at which 'recalculation' has to take place + the ddx/ddtbounds will be transferred into the T-loop process simplifying it further, however also making it less flexible for experimenting with different t-ranges or grid sharing. Grateful for your thoughts on this.
Your idea on the minmodabb
looks sound to me, but probably also good to verify it with prof. Tao on the blog. Have already tested the new function to establish the next t and did the same run for x=6*10^10+155019 and it now completed in just... ...drumroll... ...53 secs (179 rectangles). If all correct, we appear to have improved speed by a factor of 3 in the last two days !
The adjusted T-loop script is uploaded to GitHub (extension 'new minmodabb') and also attached is the stored sum output for x=6*10^10+155019 so you can replicate the run. test155019.txt
@km-git-acc
Also just completed x=10^13+9995, y=0.2, y=0.16. The stored sums (attached) obviously take a while to compute, but once done the T-loop just takes 509 secs (468 rectangles).
Will now compute the 10^15+21104 stored sums (expect it to take approx 16 hours).
@rudolph-git-acc I have added the sawtooth and approximate bounds to the repo here https://github.com/km-git-acc/dbn_upper_bound/blob/master/dbn_upper_bound/pari/abbeff_largex_bounds.txt
Also, the full mollifier has been changed back to mollifier, and the mollifier with prime terms has been changed to mollifier_prime There are also multiple functions now for the approximate bounds ep_tbound and ep_lbound which are exact and use the full mollifier ep_tbound_mollp and ep_lbound_mollp which are exact and use the prime mollifier ep_tbound_approx and ep_lbound_approx which currently use the prime mollifier but can use the full one too
Both the mollifier functions needed the t parameter too which I have added. It worked earlier if t was assigned a value at the beginning itself but could have been prone to errors.
Added the ball script for the sawtooth bounds to the repo here https://github.com/km-git-acc/dbn_upper_bound/blob/master/dbn_upper_bound/arb/Mollifier.3.and.5.sawtooth.bound.script.txt You may want to update it with a generalized mollifier.
Also, added an approx function for (A+B)/B0 using N0 terms in the barrier multieval script. I have tried it at x=10^100 and it works (though the precision \p may have to be increased first to around log(x)). Ofcourse when y and t are small the quality of the approximation is not clear. The approximation gets better the larger the B exponents and should be good enough at real(exponents) >> 1 (similar to how the zeta function behaves). Ofcourse when the real(exponents) >> 1, (A+B)/B0 moves close to 1, so the area of interest is small t and y where the approximation may not be that great.