Is it possible to store some historical information of pairs?

m3g / CellListMap.jl

Flexible implementation of cell lists to map the calculations of particle-pair dependent functions, such as forces, energies, neighbor lists, etc.

https://m3g.github.io/CellListMap.jl/

MIT License

89 stars 4 forks source link

Is it possible to store some historical information of pairs? #81

Closed maphdze closed 1 year ago

maphdze commented 1 year ago

I want to calculate a damage variable which is somehow similar to the force in MD simulation with CellListMap.jl. In my problem, the pairwise damage is related to the deformation history of the pair, and I would like to store a historical value for each pair during calculation. Is there a possible way to do this with CellListMap.jl? I will explain it with code. Attached is my code.

using CellListMap
using StaticArrays
using BenchmarkTools
@inline function my_norm(v::AbstractVector{T}) where {T}
    n = zero(T)
    @simd for x in v
        n += x ^ 2
    end
    n = sqrt(n)
    return n
end
function cal_damage!(x,y,i,j,d2,u_points,points_vol,γ,λᵦ,damage)
    d_new = my_norm(x + u_points[i] - y -u_points[j])
    λ = d_new - sqrt(d2)
    pair_dam = 0.0
    if λ > λᵦ
        pair_dam = 1.0
    end
    damage[i] += pair_dam * points_vol[j]
    damage[j] += pair_dam * points_vol[i]
    return damage
end
function map_damage!(damage_map,box,cl,u_points,points_vol,γ,λᵦ)
    damage_map .= 0.0
    map_pairwise!((x,y,i,j,d2,damage_map) -> cal_damage!(x,y,i,j,d2,u_points,points_vol,γ,λᵦ,damage_map),damage_map,box,cl)
end
N = 300000
ℓ = 0.05
γ = 1.0e4
λᵦ = 1.0e-6
points = rand(SVector{2,Float64},N)
box = Box(limits(points),ℓ)
cl = CellList(points,box)
u_points = 1e-4 * rand(SVector{2,Float64},N)
points_vol = rand(N)
damage_map = zeros(Float64,N)
map_pairwise!((x,y,i,j,d2,damage_map) -> cal_damage!(x,y,i,j,d2,u_points,points_vol,γ,λᵦ,damage_map),damage_map,box,cl)
@btime map_damage!(damage_map,box,cl,u_points,points_vol,γ,λᵦ)

In the function cal_damage!,

if λ > λᵦ
        pair_dam = 1.0
end

I would like to replace the λᵦ with a historical quantity, e.g., the maximum of λ in previous calculation, like

if λ > maximum of λ in [0,t]
        pair_dam = 1.0 
end

Is is possible with CellListMap.jl ? Thanks for your help.

lmiq commented 1 year ago

This is something that should be handled as a parameter passed by the closure, that is, in the command:

(x,y,i,j,d2,damage_map) -> cal_damage!(x,y,i,j,d2,u_points,points_vol,γ,λᵦ,damage_map)

You would need to pass a vector of previous lambdas as well to the cal_damage! function, as you do with the other parameters. And then you can do whatever you want with it inside.

The only concern there is if the lambdas are updated within that function, that you would need to take care of it being thread-safe, which complicates quite a bit everything. But if the lambdas are not updated within the function, than it is a parameter like the others.

Something not related, but that called my attention. In this line:

    d_new = my_norm(x + u_points[i] - y -u_points[j])

you are using the internal representation of the coordinates x and y, and something else u_points, which I don't know what it is. The problem here is that x and y are transformed relative to the original provided coordinates to construct the cell lists, which means that their absolute position in space is not necessarily the ones you are expecting relative to what you have in u_points. In general you should only use x and y to compute properties that depend on the relative positions of x and y.

If you need to compute something that is dependent on the original positions of the particles, you again will have to pass the original coordinates in the closure that is mapped. For example:

First, the function to be mapped would not need the internal x and y coordinates at all, but would require the points array to be given. Thus, your function would become:

# changed following line: x,y => ponts
function cal_damage!(points,i,j,d2,u_points,points_vol,γ,λᵦ,damage)
    # changed following line: x => points[i]  ;  y => points[j]
    d_new = my_norm(points[i]+ u_points[i] - points[j] -u_points[j])
    λ = d_new - sqrt(d2)
    pair_dam = 0.0
    if λ > λᵦ
        pair_dam = 1.0
    end
    damage[i] += pair_dam * points_vol[j]
    damage[j] += pair_dam * points_vol[i]
    return damage
end

And then you can close-over the coordinates as well:

# changed following line: add points as a parameter 
function map_damage!(damage_map,box,cl,u_points,points_vol,γ,λᵦ,points)
    damage_map .= 0.0
    # changed following line   x,y => points
    map_pairwise!((x,y,i,j,d2,damage_map) -> cal_damage!(points,i,j,d2,u_points,points_vol,γ,λᵦ,damage_map),damage_map,box,cl)
end

maphdze commented 1 year ago

Thanks for your reply!

The only concern there is if the lambdas are updated within that function, that you would need to take care of it being thread-safe, which complicates quite a bit everything.

That's actually what I am concerned. The lambda is calculated inside the map_pairwise! function by the following lines

d_new = my_norm(points[i]+ u_points[i] - points[j] -u_points[j])
λ = d_new - sqrt(d2)

So the historical maximum quantity of lambda should also be updated inside the 'map_pairwise!' function. I think it is something like the d2 variable. But I have no idea how the pairwise quantities are stored for the 'map_pairwise!' function. Could you please give me some suggestions about how to modify the code for this purpose? In my previous serial code, the point pairs are stored in a Vector, and the historical quantities of pairs are also stored in a Vector, and I can update the corresponding historical quantities when loop through all pairs.

you are using the internal representation of the coordinates x and y, and something else u_points, which I don't know what it is. The problem here is that x and y are transformed relative to the original provided coordinates to construct the cell lists, which means that their absolute position in space is not necessarily the ones you are expecting relative to what you have in u_points. In general you should only use x and y to compute properties that depend on the relative positions of x and y.

Thanks for pointing out it. The u_points here denotes the displacement of points, and d_new here is the new length of a pair after deformation. So the relative positions could be used here as

x - y = points[i] - points[j]

lmiq commented 1 year ago

"So the historical maximum quantity of lambda should also be updated inside the 'map_pairwise!' function. "

But you need to update the maximum lambda for each iteration, I guess? Because there you are computing a new lambda for each pair. Is there a different lambda for each pair, or is it a global one? If there is only one global lambda for all pairs, does it have to be updated every time a new greater lambda is found?

If it has to be updated everytime a new greater lambda is found (not only for each iteration), then it is impossible to write a parallel version of the algorithm. Or, actually, you could use a lock, but that will greatly reduce the performance of parallel runs. However, the greatest problem would be that the result would depend on the order in which the pairs are evaluated, in such a way that every parallel run would result in a different result, and it is also unclear what a serial versio of the algorithm would be.

Depending on those variables, the optimal solution is one or another. In the simplest case (there is one global lambda at each iteration and you will update it at the end of the iteration with the greatest lambda so far), then one option is to use as an output not only the damage, but a tuple with the damage and the maximum lambda. (or what is even easier, though somewhat ugly, is to use an damage array with N+1 positions and store the lambda in the last position.

Anyway, we need more details do discuss the optimal way to deal with that.

"Thanks for pointing out it. The u_points here denotes the displacement of points, and d_new here is the new length of a pair after deformation"

You're right, since you are using the difference of coordinates in that operation, what you are doing there is correct. Sorry for the noise.

lmiq commented 1 year ago

One way to visualize all this is to have an implementation of the same thing with the naive loop:

# data preparation, etc
# The following loop is replaced by the CellListMap machinery:
for i in 1:N-1
    for j in i+1:N
        d = norm(x[i] - x[j])
        if d < cutoff
            # do stuff
        end
    end
end

Is actually good to have such a naive implementation, not only to have everything clear, but to compare the results after the optimizations. With a correct naive implementation in hands it will be much easier to understand the complete calculation and help.

maphdze commented 1 year ago

Thanks for your reply!

But you need to update the maximum lambda for each iteration, I guess? Because there you are computing a new lambda for each pair. Is there a different lambda for each pair, or is it a global one? If there is only one global lambda for all pairs, does it have to be updated every time a new greater lambda is found?

The historical quantity is for every pair, actually it is defined as $\\kappa(x,x\prime,t)=\max_{\tau\in[0,t]}\lambda(x,x\prime,\tau)$ You are right, the maximum lambda should be updated for each iteration.

Is it possible to do like this? Define a lambda_history Vector whose length is equal to the number of pairs, then define the damage_threaded as in CellListMap.jl. The code should be like

ipair = 0
@thread for pair in all_pairs
                     i_pair += 1
                     i_point = pair[1]
                     j_point = pair[2]
                     calculate current lambda
                     if current lambda > lambda_history[i_pair]
                           lambda_history[i_pair] = current lambda
                     end
                     calculate pair_dam
                     damage_threaded[i_thread][i_point] += pair_dam
                     damage_threaded[i_thread][j_point] += pair_dam
            end
## reduce 
for i_thread in Threads.nthreads()
       damage .+= damage_threaded[i_thread]
       damage_threaded[i_thread] .= 0
end

As the lambda_history has a length equal to the number of pairs, read and write into lambda_history in different threads will not conflict with each other. I think it is possible and I would like to look into the CoreComputing.jl to find out how to parallelize it efficiently.

maphdze commented 1 year ago

One way to visualize all this is to have an implementation of the same thing with the naive loop:
# data preparation, etc
# The following loop is replaced by the CellListMap machinery:
for i in 1:N-1
    for j in i+1:N
        d = norm(x[i] - x[j])
        if d < cutoff
            # do stuff
        end
    end
end
Is actually good to have such a naive implementation, not only to have everything clear, but to compare the results after the optimizations. With a correct naive implementation in hands it will be much easier to understand the complete calculation and help.

Yes, this is clear and easy to implement and the result is absolutely right. We should always compare the optimization result with this one.

lmiq commented 1 year ago

Ok, so if the lambda is specific for each pair, that is fairly easy, as well, because its update will be thread safe anyway. Then you need only to pass the lambda_pairs array to the map_pairwise as well:

# changed following line: add lambda_pairs as a parameter
function map_damage!(damage_map,box,cl,u_points,points_vol,γ,λᵦ,lambda_pairs)
    damage_map .= 0.0
    # changed following line: pass lambda_pairs as a well
    map_pairwise!((x,y,i,j,d2,damage_map) -> cal_damage!(x,y,i,j,d2,u_points,points_vol,γ,λᵦ,damage_map),damage_map,box,cl,lambda_pairs)
end

and your cal_damage! has to update the lambda_pair for the specific pair accordingly:

function cal_damage!(x,y,i,j,d2,u_points,points_vol,γ,λᵦ,damage,lambda_pairs)
    d_new = my_norm(x + u_points[i] - y -u_points[j])
    λ = d_new - sqrt(d2)
    pair_dam = 0.0
    # update lambda_pair here somewhere and use it
    if λ > λᵦ
        pair_dam = 1.0
    end
    damage[i] += pair_dam * points_vol[j]
    damage[j] += pair_dam * points_vol[i]
    return damage
end

The tricky part is to extract the index of the pair in the lambda_pairs array. The easiest alternative is to just store a matrix of size NxN and ignore half of it:

if i > j
    lambda_pair[i,j] = ...
else
    lambda_pair[j,i] = ...
end

If you want to save half of the memory (if that is relevant at all to you, it may not be). You need to linear-index the list of pairs somehow. Considering that the storage includes the cases where i==j, which are not useful, you can do:

julia> function index_in_lambda_pairs(N,i,j)
           if i > j
               tup = (i,j)
           else
               tup = (j,i)
           end
           return LinearIndices((N,N))[tup...]
       end
index_in_lambda_pairs (generic function with 1 method)

julia> N = 3
3

julia> lambda_pairs = zeros(div(N*(N-1),2) + N); # triangular, including diagonal;

julia> length(lambda_pairs)
6

julia> index_in_lambda_pairs(3,1,2)
2

julia> index_in_lambda_pairs(3,2,1)
2

julia> index_in_lambda_pairs(3,3,2)
6

julia> index_in_lambda_pairs(3,2,3)
6

(indexing excluding the diagonal is much more cumbersome).

lmiq commented 1 year ago

I think it is possible and I would like to look into the CoreComputing.jl to find out how to parallelize it efficiently.

I don´t think you need to go to any of the internals of CellListMap for this. You can handle all that from the interface. In the specific case where you could want to update many different properties at the same time, you need to store all output variables inside a struct, as in this example:

https://m3g.github.io/CellListMap.jl/stable/PeriodicSystems/#Energy-and-forces

You could use something similar to have a struct storing the damage and the lambda_pairs, but this would be an overkill in your case, IMO. Particularly because your lambda_pairs is already thead-safe.

maphdze commented 1 year ago

Thanks for your reply!

I think it is possible

Actually I have written a simple code with @threads. The codes are shown here

using CellListMap
using StaticArrays
using BenchmarkTools
using Base.Threads
using Revise
@inline function my_norm(v::AbstractVector{T}) where {T}
    n = zero(T)
    @simd for x in v
        n += x ^ 2
    end
    n = sqrt(n)
    return n
end
function thread_damage!(damage_threaded,points,u_points,points_vol,λᵦ,pairhis,point_pairs)
    Threads.@threads for i_pair in eachindex(point_pairs)
        i_thread::Int64 = Threads.threadid()
        i_point = point_pairs[i_pair][1]
        j_point = point_pairs[i_pair][2]
        d_new = my_norm(points[i_point] + u_points[i_point] - points[j_point] - u_points[j_point])
        λ = d_new - point_pairs[i_pair][3]
        if λ > pairhis[i_pair]
            pairhis[i_pair] = λ
        end
        pair_dam = 0.0
        if pairhis[i_pair] > λᵦ
            pair_dam = 1.0
        end
        damage_threaded[i_thread][i_point] += pair_dam * points_vol[j_point]
        damage_threaded[i_thread][j_point] += pair_dam * points_vol[i_point]
    end
end
function reduce!(output::T,output_threaded::Vector{T}) where {T <: AbstractArray}
    output .= 0.0
    for i_thread in eachindex(output_threaded)
        @. output += output_threaded[i_thread]
        output_threaded[i_thread] .= 0.0
    end
    return output
end   
function cal_damage!(damage,damage_threaded,points,u_points,points_vol,λᵦ,pairhis,pairs)  
    thread_damage!(damage_threaded,points,u_points,points_vol,λᵦ,pairhis,pairs)
    reduce!(damage,damage_threaded)
end
@inline function loop_damage!(damage,points,u_points,points_vol,point_pairs,λᵦ,pairhis_loop)
    damage .= 0.0
    pair_dam = 0.0
    i_pair = 0
    @inbounds for pair in point_pairs
        i_pair += 1
        d_new = my_norm(points[pair[1]] + u_points[pair[1]] - points[pair[2]] - u_points[pair[2]])
        λ = d_new - pair[3]
        if λ > pairhis_loop[i_pair]
            pairhis_loop[i_pair] = λ
        end
        pair_dam = 0.0
        if pairhis_loop[i_pair] > λᵦ
            pair_dam = 1.0
        end
        damage[pair[1]] += pair_dam * points_vol[pair[2]]
        damage[pair[2]] += pair_dam * points_vol[pair[1]]
    end
    return damage
end
N = 100000
ℓ = 0.05
γ = 1.0e4
λᵦ = 1.0e-6
points = rand(SVector{2,Float64},N)
box = Box(limits(points),ℓ)
cl = CellList(points,box)
u_points = 1e-4 * rand(SVector{2,Float64},N)
points_vol = rand(N)
damage_para = zeros(Float64,N)
damage_threaded = [deepcopy(damage_para) for i in 1:Threads.nthreads()]
damage_loop = zeros(Float64,N)
point_pairs = neighborlist(points,ℓ)
num_pairs = length(point_pairs)
pairhis_para = zeros(Float64,num_pairs)
pairhis_loop = zeros(Float64,num_pairs)
@btime loop_damage!($damage_loop,$points,$u_points,$points_vol,$point_pairs,$λᵦ,$pairhis_loop)
@btime cal_damage!($damage_para,$damage_threaded,$points,$u_points,$points_vol,$λᵦ,$pairhis_para,$point_pairs)
damage_para ≈ damage_loop

The above codes works well when the number of threads is less than 8. When the number of threads is 10 or something larger, it runs slower. I noticed that in the CellListMap.jl the macro @spawn is adopted, and it works fine when the number of threads is larger than 10. I will look more into the parallel section of document to figure out why.

Another thing that is somehow strange to me is that, both the @spawn version in CellListMap.jl and the @threads version above runs slower than the single thread for-loop version when there is only 1 thread. I don't know too much about the @spawn macro, but the @threads macro should be the same with single thread for-loop when there is only 1 thread.

The tricky part is to extract the index of the pair in the lambda_pairs array

Yes, this is what I am confused. For the past, I used to store the lambda_pairs in a Vector so that it can be indexed with the id of pairs. As in the map_pairwise! the id of pairs is not provided, I am a little bit confused.

The easiest alternative is to just store a matrix of size NxN and ignore half of it:

If you want to save half of the memory (if that is relevant at all to you, it may not be). You need to linear-index the list of pairs somehow. Considering that the storage includes the cases where i==j, which are not useful, you can do:

And these are really good suggestions. As there is a cutoff, this matrix should be sparse, I think.

lmiq commented 1 year ago

I see. What you did is mostly ok (there is a small caveat related to using Threads.threadis(), because the threads can in principle migrate), but most likely you won't experience that. (see https://juliafolds.github.io/FLoops.jl/dev/explanation/faq/#faq-state-threadid and https://github.com/m3g/ChunkSplitters.jl).

However, the main point is that you do not need (as far as I understand) to compute the neighbor list at all. You can compute the damage and update lambda_pairs directly, without the intermediate step. Otherwise you will be paying the price of running over all pairs twice every time.

All threaded-versions have some overhead, which should be small if the number of particles is large. For small numbers of particles than certainly simple serial loops will be faster than using any other machinery.

Concerning the indexes: if you cannot store a list of lambdas with size equal to the number of all possible pairs (not considering the cutoff), then your situation is more complicated, because you need to find which is the index of the corresponding lambda in a list that is not necessarily ordered nor even contain the pair in question. That will slow down a lot these calculations. If you can store a vector of lambdas for all pairs, go for it no doubt.

Note that if you compute the neighbor list beforehand, you have a list of lambdas. In the next iteration, if you compute a new neighbor list for new positions, the list will not correspond anymore to the same lambda list. You need to store, for each lambda, the indexes of the particles involved, and find in that list the correct lambda for each pair of the new neighbor list. This is very costly.

maphdze commented 1 year ago

Thanks for your reply!

However, the main point is that you do not need (as far as I understand) to compute the neighbor list at all.

I don't quite understand it. As long as there is a cutoff, if I want to loop through all pairs, I have to build up a neighbor list before the computation start.

You can compute the damage and update lambda_pairs directly, without the intermediate step.

I don't get this point too.

Otherwise you will be paying the price of running over all pairs twice every time.

For now I just run over all pairs once each time.

All threaded-versions have some overhead, which should be small if the number of particles is large. For small numbers of particles than certainly simple serial loops will be faster than using any other machinery.

Thanks for pointing out it. Currently I have a Laptop with 32GB memory, and the largest number of particle I can run with my laptop is 300,000 with a cutoff = 0.05.

For 300,000 particles, the multi-thread version cost about 2 times of the time than single thread for-loop when there is only 1 thread available;
When number of threads > 10, the computational time stop to decrease with the increasing of number of threads.

Does this mean the number of particles is too small?

Concerning the indexes: if you cannot store a list of lambdas with size equal to the number of all possible pairs (not considering the cutoff), then your situation is more complicated, because you need to find which is the index of the corresponding lambda in a list that is not necessarily ordered nor even contain the pair in question. That will slow down a lot these calculations. If you can store a vector of lambdas for all pairs, go for it no doubt.

That's right. Time for finding the index of a pair could be a big number when there are too many pairs.

Note that if you compute the neighbor list beforehand, you have a list of lambdas. In the next iteration, if you compute a new neighbor list for new positions, the list will not correspond anymore to the same lambda list. You need to store, for each lambda, the indexes of the particles involved, and find in that list the correct lambda for each pair of the new neighbor list. This is very costly.

You are right. Currently only small deformation cases are considered, which means that the neighbor list are not updated during simulation.

Btw, I want to know is there something I can do to improve the performance of the multi-threaded version when the number of threads is greater than 10?

lmiq commented 1 year ago

There are two possible strategies here:

Compute the neighborlist in advance, and then run over the pairs of the list. This is what you've done in the threaded code above. It has two steps:

# 1
nb = neighborlist(points, cutoff)
# 2 
for pairs in nb
    # do stuff
end

Here you are essentially running over the pairs (within the cutoff) twice: first to compute he neighbor list, second to "compute stuff". If you have to compute many different properties from the same neighbor list, that's fine. This is also common if doing multiple-time stepping, in which the neighborlist is computed up to a cutoff greater than the actual cutoff, and only updated once in a while in the "simulation". I'm not sure if that is your case.

Compute stuff directly, without actually computing and storing the neighborlist.

This is what is mapped to the simple naive loop:

for i in 1:N
    for j in i+1:N
        # compute stuff for pair (i,j)
    end
end

Note that for doing this we don't need the explicit neighbor list, nor storing the (perhaps very large) list of neighbors. Your damage property can be computed like this. This is what is mapped, in CellListMap, to:

box = Box(limits(x), cutoff)
cl = CellList(x,box)
map_parwise( 
    (x,y,i,j,d2,output) -> # compute stuff for pair (i,j)
    box, cl
)

Here we also do not build the explicit neighbor list.

Does this mean the number of particles is too small?

No, 300k is large. But you are presenting here a different problem, which is parallelizing only the computation for pairs, given the pair list.

One possible problem with a loop like

    Threads.@threads for i_pair in eachindex(point_pairs)
        i_thread::Int64 = Threads.threadid()

is that you are distributing the pairs on individual threads. This can be costly, if the inner calculation is cheap. A better strategy is to split the workload into chunks, and launch a number of threads corresponding to the number of chuncks. This is what the https://github.com/m3g/ChunkSplitters.jl package helps you to do. You would write that loop as:

using ChunkSplitters
...
nchunks = Threads.nthreads() # it can be larger, or smaller, and sometimes better performance can be obtained
Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs), nchunks)
    for i_pair in pair_range
         # do stuff for i_pair, identical as before
         ....
         # note that now the damage_threaded is indexes by ichunk (and has to have size nchunks)
        damage_threaded[ichunk][i_point] += pair_dam * points_vol[j_point]
        damage_threaded[ichunk][j_point] += pair_dam * points_vol[i_point]
...

This is usually faster than simply threading the eachindex-loop. Also it solves the problem of assigning the damage_threaded array with the threadid() index, which can cause eventual problems.

lmiq commented 1 year ago

I rewrote your loop as:

using ChunkSplitters
...
function thread_damage_chunks!(damage_threaded,points,u_points,points_vol,λᵦ,pairhis,point_pairs)
    nchunks = Threads.nthreads()
    Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs), nchunks)
        for i_pair in pair_range
            i_point = point_pairs[i_pair][1]
            j_point = point_pairs[i_pair][2]
            d_new = my_norm(points[i_point] + u_points[i_point] - points[j_point] - u_points[j_point])
            λ = d_new - point_pairs[i_pair][3]
            if λ > pairhis[i_pair]
                pairhis[i_pair] = λ
            end
            pair_dam = 0.0
            if pairhis[i_pair] > λᵦ
                pair_dam = 1.0
            end
            damage_threaded[ichunk][i_point] += pair_dam * points_vol[j_point]
            damage_threaded[ichunk][j_point] += pair_dam * points_vol[i_point]
        end
    end
end

But that didn't bring any speedup (I get a ~2x speedup here relative to the serial version, with 12 threads).

It is possible that you are memory-limited.

I could get a small speedup by using the low-level Polyester.@batch paralelization of the inner loop:

function thread_damage_chunks!(damage_threaded,points,u_points,points_vol,λᵦ,pairhis,point_pairs)
    nchunks = Threads.nthreads()
    Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs), nchunks)
        Polyester.@batch for i_pair in pair_range
            i_point = point_pairs[i_pair][1]
            j_point = point_pairs[i_pair][2]
            d_new = my_norm(points[i_point] + u_points[i_point] - points[j_point] - u_points[j_point])
            λ = d_new - point_pairs[i_pair][3]
            pairhis[i_pair] = ifelse(λ > pairhis[ipair], λ, pairhis[ipair])
            pair_dam = 0.0
            pairhis[i_pair] = ifelse(pairhis[i_pair] > λᵦ, pair_dam, pairhis[ipair])
            damage_threaded[ichunk][i_point] += pair_dam * points_vol[j_point]
            damage_threaded[ichunk][j_point] += pair_dam * points_vol[i_point]
        end
    end
end

but it went from 60 to 55 ms, so I'm not sure if it is worth the trouble.

maphdze commented 1 year ago

I got it. The main reason that I choose to loop through pairs is that there is a cutoff and the neighbor list will be used throughout the simulation, maybe thousands times. Also, if you use the naive double for-loop, there will be too much unnecessary search in my case as there is a cutoff. When cell list is adopted, unnecessary search is less but there are still some. But for really big number of pairs where the memory is the main problem, what you mentioned should be applied.

No, 300k is large. But you are presenting here a different problem, which is parallelizing only the computation for pairs, given the pair list.

The same question also arises when use the map_pairwise! in CellListMap.jl in my laptop. So I think parallelizing the computation of pairs is not the main reason.

A better strategy is to split the workload into chunks, and launch a number of threads corresponding to the number of chuncks. This is what the https://github.com/m3g/ChunkSplitters.jl package helps you to do. You would write that loop as:

Thanks for telling me that. I tried the version with ChunkSplitters.jl for 300k particles, 0.05 cutoff with 10 threads. The computing time is 365.466ms with ChunkSplitters.jl and 379.414ms without ChunkSplitters. But when I tried it with 12 threads, it get slower, i.e., 406.017ms with ChunkSplitters.jl and 392.637ms without ChunkSplitters.

I tried the version with Polyester.@batch, but

damage_para ≈ damage_loop  ## it gives false

The hardware of my laptop is

Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 12 on 20 virtual cores
Environment:
  JULIA_PKG_SERVER = https://mirrors.ustc.edu.cn/julia

with a 32GB RAM.

Maybe it is due to the memory problem. I will try it on some workstations later.

Thanks again for your kind help! Actually you have helped me several times, on the julia discourse:).

lmiq commented 1 year ago

I'm curious, if the neighbors are fixed, what you compute thousand of times, for the same coordinates?

In all cases you are probably bound to the speed of the memory access in your computer.

maphdze commented 1 year ago

I'm curious, if the neighbors are fixed, what you compute thousand of times, for the same coordinates?

No, the points will move, so there is the Vector u_points, which denotes the displacement of the points. Actually the displacement u_points is updated every iteration, but they are very small deformations compared with the cutoff. So we usually do not update the neighbor list.

In all cases you are probably bound to the speed of the memory access in your computer.

Do you mean the frequency of the memory or the amount of memory? Currently there is DDR5 4800MHz memory in the laptop.

lmiq commented 1 year ago

very small deformations compared with the cutoff.

Be careful, generally in that case one uses a cutoff to build the lists larger than the actual cutoff, and test for the true cutoff again for every pair. Otherwise you can have boundary problems.

Concerning the memory, it is just that the property you compute there for each pair depends on getindex operations on many different arrays. This can be slower than everything else, and in that case there is not much to be done without changing a lot the data structure in non-obvious ways.

maphdze commented 1 year ago

Thanks for your reply.

Otherwise you can have boundary problems.

That's right for MD and Peridynamcis, as in these models the pairwise force is calculated. In our model only damage variable is calculated, and the force is calculated within continuum mechanics, so the surface effect will not occur.

This can be slower than everything else, and in that case there is not much to be done without changing a lot the data structure in non-obvious ways.

Thanks for telling me this. I just tried the same code with an old workstation with 56GB memory and Intel(R) Xeon(R) E-2288G CPU @ 3.70GHz(16 threads). The same question arises, when the number of threads is larger than 10, the performance is worse than 8 threads. I guess it is the splitting the tasks into different threads which cost more time? Maybe we can post it on the discourse to see if there is some improvements.

lmiq commented 1 year ago

For a reference, this is a computation and scaling (I have 6 cores/12 threads), without any memory access (just summing up the squared distances):

julia> points = rand(SVector{2,Float64},100_000)
       box = Box(limits(points), 0.05)
       for nb in (1,2,4,6,8,10,12)
           cl = CellList(points,box; nbatches=(0,nb))
           @btime map_pairwise!((x,y,i,j,d2,u) -> u += d2, 0.0, $box, $cl)
       end
  112.736 ms (20 allocations: 1.72 KiB)
  57.675 ms (28 allocations: 2.91 KiB)
  31.245 ms (44 allocations: 5.27 KiB)
  26.422 ms (60 allocations: 7.62 KiB)
  23.321 ms (76 allocations: 9.98 KiB)
  19.235 ms (93 allocations: 12.67 KiB)
  17.088 ms (109 allocations: 15.03 KiB)

Thus, here, I get ~6.5x times speedup with 12 threads, which is ok given the fact that I have 6 real cores.

I was playing a little with your code. I think there is no escape out of the fact that you have one hard problem in terms of memory access. Since the list of pairs is "random" (at each position of the list you can have the indexes of essentially any two particles), the accesses to all the properties of those particles (u_point, lambdas, points_vol, etc) are all arbitrary, and the fetching of all that data from the memory is costly.

One thing that may be worth trying is to pack all the point properties in a struct, like this:

struct PointProperties{T}
    coor::SVector{2,T}
    u::SVector{2,T}
    vol::T
end

and build an array of point properties.

Then the function would become something like this:

function thread_damage_chunks!(damage_threaded,points,u_points,points_vol,λᵦ,pairhis,point_pairs)
    nchunks = Threads.nthreads()
    Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs), nchunks)
        @inbounds for i_pair in pair_range
            i_point = point_pairs[i_pair][1]
            j_point = point_pairs[i_pair][2]

            x = point_properties[i_point] # 1 memory access
            y = point_properties[j_point] # 1 memory access

            d_new = my_norm(x.coor + x.u - y.coor - y.u)

            λ = d_new - point_pairs[i_pair][3]
            if λ > pairhis[i_pair]
                pairhis[i_pair] = λ
            end

            pair_dam = pairhis[i_pair] > λᵦ
            damage_threaded[ichunk][i_point] += pair_dam * x.vol
            damage_threaded[ichunk][j_point] += pair_dam * y.vol
        end
    end
end

in which we have minimized the number of fetches from memory a bit. That kind of thing can help.

(You can of course ask on Discourse, I think you'll get similar answers).

lmiq commented 1 year ago

I tested the above idea, but even then the access to memory is the limiting factor:

As you can see there (or not... the image is not great), the profile says that getindex and setindex operations take most of the time. These are accesses to memory. There is also a > which is the test for lambda and the - which is the line lambda = dnew - d.

You can get that by running

@profview cal_damage_chunks!(damage_para,damage_threaded,points,λᵦ,pairhis_para,point_pairs)

from within VSCode. The code is below.

If you need large speedups on that, you probably have to rethink how to access the data. Maybe just copy all data for each thread, if that is possible. Another (not easy) possibility is trying to port that calculation to the GPU. That may be feasible, although it will also require some major data structure rearrangements.

using CellListMap, ChunkSplitters, StaticArrays
struct Points{T}
    coor::SVector{2,T}
    u::SVector{2,T}
    vol::T
end
function thread_damage_chunks!(damage_threaded,points,λᵦ,pairhis,point_pairs)
    nchunks = length(damage_threaded)
    Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs), nchunks)
        @inbounds for i_pair in pair_range
            i_point, j_point, d = point_pairs[i_pair]
            ph = pairhis[i_pair]
            x = points[i_point]
            y = points[j_point]
            d_new = my_norm(x.coor + x.u - y.coor - y.u)
            λ = d_new - d
            if λ > ph
                ph = λ
                pairhis[i_pair] = λ
            end
            pair_dam = ph > λᵦ
            damage_threaded[ichunk][i_point] += pair_dam * x.vol
            damage_threaded[ichunk][j_point] += pair_dam * y.vol
        end
    end
end
function cal_damage_chunks!(damage,damage_threaded,points,λᵦ,pairhis,pairs)  
    thread_damage_chunks!(damage_threaded,points,λᵦ,pairhis,pairs)
    reduce!(damage,damage_threaded)
end
function run3()
    N = 100000
    ℓ = 0.05
    γ = 1.0e4
    λᵦ = 1.0e-6
    points = [ Points(rand(SVector{2,Float64}), 1e-4*rand(SVector{2,Float64}), rand()) for _ in 1:N ]
    coor = [p.coor for p in points]
    damage_para = zeros(Float64,N)
    damage_threaded = [deepcopy(damage_para) for i in 1:Threads.nthreads()]
    point_pairs = neighborlist(coor,ℓ)
    num_pairs = length(point_pairs)
    pairhis_para = zeros(Float64,num_pairs)
    nchunks = 1
    damage_threaded = [deepcopy(damage_para) for i in 1:nchunks]
    cal_damage_chunks!(damage_para,damage_threaded,points,λᵦ,pairhis_para,point_pairs)
end

maphdze commented 1 year ago

Thanks for your reply! I remember your post in discourse about accelerating the pairwise L-J potential calculation, and the solution seems to be sorting the pairs first to optimize the access to memory then the performance will improve. But later I find that it depends on your code largely. If you comment out one special line, then sorting does not matter. I comment on that post but no one replies. Actually I have implement what you said last night, but it was too late that I didn't post the stuff here. Here is my implementation:

using CellListMap
using StaticArrays
using BenchmarkTools
using Base.Threads
using Revise
using ChunkSplitters
struct PointInf{T}
    coo::SVector{2,T}
    u::SVector{2,T}
    vol::T
end
@inline function my_norm(v::AbstractVector{T}) where {T}
    n = zero(T)
    @simd for x in v
        n += x ^ 2
    end
    n = sqrt(n)
    return n
end
@inline function thread_damage_chunks!(damage_threaded,points,u_points,points_vol,λᵦ,pairhis,point_pairs)
    nchunks = Threads.nthreads()
    @fastmath @inbounds Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs), nchunks)
        for i_pair in pair_range
            i_point = point_pairs[i_pair][1]
            j_point = point_pairs[i_pair][2]
            d_new = my_norm(points[i_point] + u_points[i_point] - points[j_point] - u_points[j_point])
            λ = d_new - point_pairs[i_pair][3]
            if λ > pairhis[i_pair]
                pairhis[i_pair] = λ
            end
            pair_dam = 0.0
            if pairhis[i_pair] > λᵦ
                pair_dam = 1.0
            end
            damage_threaded[ichunk][i_point] += pair_dam * points_vol[j_point]
            damage_threaded[ichunk][j_point] += pair_dam * points_vol[i_point]
        end
    end
end
@inline function thread_damage_chunks!(damage_threaded,pointsinf,λᵦ,pairhis,point_pairs)
    nchunks = Threads.nthreads()
    @fastmath @inbounds Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs), nchunks)
        for i_pair in pair_range
            i_point = point_pairs[i_pair][1]
            j_point = point_pairs[i_pair][2]
            x = pointsinf[i_point]
            y = pointsinf[j_point]
            d_new = my_norm(x.coo + x.u - y.coo - y.u)
            λ = d_new - point_pairs[i_pair][3]
            if λ > pairhis[i_pair]
                pairhis[i_pair] = λ
            end
            pair_dam = 0.0
            if pairhis[i_pair] > λᵦ
                pair_dam = 1.0
            end
            damage_threaded[ichunk][i_point] += pair_dam * y.vol
            damage_threaded[ichunk][j_point] += pair_dam * x.vol
        end
    end
end
function reduce!(output::T,output_threaded::Vector{T}) where {T <: AbstractArray}
    output .= 0.0
    for i_thread in eachindex(output_threaded)
        @. output += output_threaded[i_thread]
        output_threaded[i_thread] .= 0.0
    end
    return output
end
function cal_damage!(damage,damage_threaded,points,u_points,points_vol,λᵦ,pairhis,pairs)  
    thread_damage_chunks!(damage_threaded,points,u_points,points_vol,λᵦ,pairhis,pairs)
    reduce!(damage,damage_threaded)
end
function cal_damage!(damage,damage_threaded,pointsinf,λᵦ,pairhis,pairs)  
    thread_damage_chunks!(damage_threaded,pointsinf,λᵦ,pairhis,pairs)
    reduce!(damage,damage_threaded)
end
@inline function loop_damage!(damage,pointsinf,point_pairs,pairhis_loop,λᵦ)
    damage .= 0.0
    pair_dam = 0.0
    i_pair = 0
    @fastmath @inbounds for pair in point_pairs
        i_pair += 1
        i_poi = pair[1]
        j_poi = pair[2]
        x = pointsinf[i_poi]
        y = pointsinf[j_poi]
        d_new = my_norm(x.coo + x.u - y.coo - y.u)
        λ = d_new - pair[3]
        if λ > pairhis_loop[i_pair]
            pairhis_loop[i_pair] = λ
        end
        pair_dam = 0.0
        if pairhis_loop[i_pair] > λᵦ
            pair_dam = 1.0
        end
        damage[i_poi] += pair_dam * y.vol
        damage[j_poi] += pair_dam * x.vol
    end
    return damage
end
@inline function loop_damage!(damage,points,u_points,points_vol,point_pairs,λᵦ,pairhis_loop)
    damage .= 0.0
    pair_dam = 0.0
    i_pair = 0
    @fastmath @inbounds for pair in point_pairs
        i_pair += 1
        d_new = my_norm(points[pair[1]] + u_points[pair[1]] - points[pair[2]] - u_points[pair[2]])
        λ = d_new - pair[3]
        if λ > pairhis_loop[i_pair]
            pairhis_loop[i_pair] = λ
        end
        pair_dam = 0.0
        if pairhis_loop[i_pair] > λᵦ
            pair_dam = 1.0
        end
        damage[pair[1]] += pair_dam * points_vol[pair[2]]
        damage[pair[2]] += pair_dam * points_vol[pair[1]]
    end
    return damage
end
N = 100000
ℓ = 0.05
γ = 1.0e4
λᵦ = 1.0e-6
points = rand(SVector{2,Float64},N)
box = Box(limits(points),ℓ)
cl = CellList(points,box)
u_points = 1e-4 * rand(SVector{2,Float64},N)
points_vol = rand(N)
pointsinf = [PointInf(points[i],u_points[i],points_vol[i]) for i = 1 : N]
damage_para = zeros(Float64,N)
damage_threaded = [deepcopy(damage_para) for i in 1:Threads.nthreads()]
damage_para_1 = zeros(Float64,N)
damage_threaded_1 = [deepcopy(damage_para_1) for i in 1:Threads.nthreads()]
damage_loop = zeros(Float64,N)
damage_loop_1 = zeros(Float64,N)
point_pairs = neighborlist(points,ℓ)
num_pairs = length(point_pairs)
pairhis_para = zeros(Float64,num_pairs)
pairhis_para_1 = zeros(Float64,num_pairs)
pairhis_loop = zeros(Float64,num_pairs)
pairhis_loop_1 = zeros(Float64,num_pairs)
@btime cal_damage!(damage_para,damage_threaded,points,u_points,points_vol,λᵦ,pairhis_para,point_pairs) ## para without memory access optimization
@btime cal_damage!(damage_para_1,damage_threaded_1,pointsinf,λᵦ,pairhis_para_1,point_pairs) ## para with memory access optimization
# @btime loop_damage!(damage_loop_1,points,u_points,points_vol,point_pairs,λᵦ,pairhis_loop_1) ## serial without access opt
# @btime loop_damage!(damage_loop,pointsinf,point_pairs,pairhis_loop,λᵦ) ## serial with memory access optimization

After optimize the access to memory, there is a 10ms improvement for the parallel version in my laptop, and 2-3ms improvement for the serial version in my laptop. That really helps a lot.

One strange thing here is that, if you run the above code with julia -t 1, then result is that

158.293 ms (6 allocations: 720 bytes)  # parallel code with 1 thread
107.003 ms (0 allocations: 0 bytes)   # serial code

Is this caused by the so called "overhead" ?

The other strange thing is that, I suppose these lines will allocate new memory

x = pointsinf[i_point]
y = pointsinf[j_point]

But actually they don't. Is it because the struct is too small?

If you need large speedups on that, you probably have to rethink how to access the data.

That's really a question to me. Yesterday you told me that the randomly access to memory will slow down the program. Then I also tried to modify the storage of the pairs, from a Vector of Turple to a Vector of struct, like

mutable struct PairInf
    i_poi::Int64
    j_poi::Int64
    d::Float64
    his::Float64
end

But when I implement it, the program get slower. Attached is my code

struct PointInf{T}
    coo::SVector{2,T}
    u::SVector{2,T}
    vol::T
end
mutable struct PairInf
    i_poi::Int64
    j_poi::Int64
    d::Float64
    his::Float64
end
function push_pair!(i,j,d2,pairs,cutoff)
    d = sqrt(d2)
    if d < cutoff
        push!(pairs,PairInf(i,j,d,0.0))
    end
    return pairs
end
function reduce_pairs(pairs, pairs_threaded)
    for i in eachindex(pairs_threaded)
        append!(pairs, pairs_threaded[i])
    end
    return pairs
end
function my_neighborlist!(pairs,box,cl,cutoff)
    map_pairwise!((x,y,i,j,d2,pairs) -> push_pair!(i,j,d2,pairs,cutoff),pairs,box,cl,reduce=reduce_pairs,parallel = true)
end
@inline function my_norm(v::AbstractVector{T}) where {T}
    n = zero(T)
    @simd for x in v
        n += x ^ 2
    end
    n = sqrt(n)
    return n
end
@inline function thread_damage!(damage_threaded,pointsinf,λᵦ,pairsinf)
    @fastmath @inbounds Threads.@threads for pair in pairsinf
            i_thread = Threads.threadid()
            i_point = pair.i_poi
            j_point = pair.j_poi
            x = pointsinf[i_point]
            y = pointsinf[j_point]
            d_new = my_norm(x.coo + x.u - y.coo - y.u)
            λ = d_new - pair.d
            if λ > pair.his
                pair.his = λ
            end
            pair_dam = 0.0
            if pair.his > λᵦ
                pair_dam = 1.0
            end
            damage_threaded[i_thread][i_point] += pair_dam * y.vol
            damage_threaded[i_thread][j_point] += pair_dam * x.vol
    end
end
function cal_damage_2!(damage,damage_threaded,pointsinf,λᵦ,pairsinf)  
    # thread_damage!(damage_threaded,points,u_points,points_vol,λᵦ,pairhis,pairs)
    thread_damage!(damage_threaded,pointsinf,λᵦ,pairsinf)
    reduce!(damage,damage_threaded)
end

Generally it is 10ms slower than just optimize the access to point properties. As there is actually less getindex! operation in this version, I don't quite understand what happened. Maybe the access to pairs in not so random?

Another strange thing is that usually if the if is replaced with ifelse there will be an improvement. But when I try it, the program get slower too, both the serial and parallel version.

lmiq commented 1 year ago

Is this caused by the so called "overhead" ?

Well, hard to say. It may be the call to reduce, which is not present in the serial version. A parallel version always will have some overhead, although ~50 ms seems to much.

The other strange thing is that, I suppose these lines will allocate new memory
x = pointsinf[i_point]
y = pointsinf[j_point]
But actually they don't. Is it because the struct is too small?

No, they do not allocate because the objects are immutable. "Allocation" means allocating a position in the heap memory, with an address. There you are only creating an object that lives on the stack memory, or the register, or may be even optimized out. You would only allocate if the object was mutable and you were copying it explicitly. (see, for instance: https://m3g.github.io/JuliaNotes.jl/stable/immutable/).

Another strange thing is that usually if the if is replaced with ifelse there will be an improvement.

That happens if the loop is loop-vectorized (in the LoopVectorization.jl sense). But that can require, again, some reestructuring of the code. I tried it and failed, maybe Chris Elrod finds a way around it. But that will be hard, because for that to work you have all memory accesses aligned, and your update of damage_threaded, depending on sort of random indexes of particles, won´t allow that to happen easily. (to be truth I never was able to use LoopVectorization execept for very simple repetitive tasks).

But when I implement it, the program get slower.

I also played with the idea, and the performance didn't improve either. I don't have really any other suggestion, I also tried a bunch of things and failed to accelerate that. You have a correct code there from most of what I can tell, and the performance bottleneck is already on the speed of getindex operations and the floating point operations inside your loop, so there is no low-hanging fruit anymore.

If you need that to be much faster, or scale much better, you will have to find a way to, probably, parallelize it in a higher level, to depend less on shared data between threads, something like that. One trivial way to do that is, as I mentioned, copy the data for each thread (like you do with output), and let the threads work completely independently. That, of course, may be complicated in your case if you have to store the complete neighbor list.

lmiq commented 1 year ago

ps: I tried the copying of all variables, but it didn´t help either. I'm out of ideas, I don't know how to scale that better.

#run6
struct Points6{T}
    coor::SVector{2,T}
    u::SVector{2,T}
    vol::T
end
function thread_damage_chunks6!(damage_threaded,points_threaded,λᵦ,pairhis_threaded,point_pairs_threaded)
    nchunks = length(damage_threaded)
    Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs_threaded[begin]), nchunks)
        @inbounds for i_pair in pair_range
            i_point, j_point, d = point_pairs_threaded[ichunk][i_pair]
            ph = pairhis_threaded[ichunk][i_pair]
            x = points_threaded[ichunk][i_point]
            y = points_threaded[ichunk][j_point]
            d_new = my_norm(x.coor + x.u - y.coor - y.u)
            λ = d_new - d
            if λ > ph
                ph = λ
                pairhis_threaded[ichunk][i_pair] = λ
            end
            pair_dam = ph > λᵦ
            damage_threaded[ichunk][i_point] += pair_dam * x.vol
            damage_threaded[ichunk][j_point] += pair_dam * y.vol
        end
    end
end
function cal_damage_chunks6!(damage,damage_threaded,points_threaded,λᵦ,pairhis_threaded,pairs_threaded)  
    thread_damage_chunks6!(damage_threaded,points_threaded,λᵦ,pairhis_threaded,pairs_threaded)
    reduce!(damage,damage_threaded)
    return nothing
end
function run6(;nchunks=Threads.nthreads())
    N = 100_000
    ℓ = 0.05
    γ = 1.0e4
    λᵦ = 1.0e-6
    points = [ Points6(rand(SVector{2,Float64}), 1e-4*rand(SVector{2,Float64}), rand()) for _ in 1:N ]
    coor = [p.coor for p in points]
    damage_para = zeros(Float64,N)

    point_pairs = neighborlist(coor,ℓ)
    num_pairs = length(point_pairs)
    pairhis_para = zeros(Float64,num_pairs)

    damage_threaded = [deepcopy(damage_para) for _ in 1:nchunks]
    points_threaded = [copy(points) for _ in 1:nchunks]
    pairhis_para_threaded = [ copy(pairhis_para) for _ in 1:nchunks ]
    point_pairs_threaded = [ copy(point_pairs) for _ in 1:nchunks ]

#    cal_damage_chunks6!(damage_para,damage_threaded,points_threaded,λᵦ,pairhis_para_threaded,point_pairs_threaded)
    @btime cal_damage_chunks6!($damage_para,$damage_threaded,$points_threaded,$λᵦ,$pairhis_para_threaded,$point_pairs_threaded)
end

maphdze commented 1 year ago

Thanks for your reply and kind help! Your notes really tell me something that I used to be confused with. When I started to know the memory allocation is sometimes bottle neck for performance, I tried to allocate every variable beforehand, like in fortran, even a scalar. But sometimes it does not work and make the code ugly. Then one day I found that if you define a new scalar inside your function, it does not allocate memory. I am really confused because I was a fortran user before. Now I start to know the reason.

to be truth I never was able to use LoopVectorization execept for very simple repetitive tasks

The same situation for me! As long as there is if, <, > in the loop, loopvectorization just stop work.

I also tried a bunch of things and failed to accelerate that. You have a correct code there from most of what I can tell,

A very big thanks to you. I really learned sth from the communation with you.

If you need that to be much faster, or scale much better

Not too much, haha. I used to think the speedup is related to the number of threads so I am confused that I have 20 threads but only 6-7x speedup. Today I learned that the speedup relates just to the number of physical cores not the threads. As long as I have only 8 big physical cores (4 small cores which are useless), I think the speedup is satisfactory:).

A parallel version always will have some overhead, although ~50 ms seems to much.

That is what I am confused now. I will post it on the discourse later.

Anyway, thanks a loooooot for your help and discussions.

lmiq commented 1 year ago

Today I learned that the speedup relates just to the number of physical cores not the threads.

Sort off. Most computers have N cores and 2N threads, because of hyperthreading. But hyperthreading is a trick to take advantage of idle processing capacity of the cores. But if your calculation already fills up the cores at their maximum, hyperthreading is not useful at all. That's very much dependent on each application and the load on each core. For very codes that saturate the cores with floating point operations usually hyperthreading does not help.

maphdze commented 1 year ago

But if your calculation already fills up the cores at their maximum, hyperthreading is not useful at all. That's very much dependent on each application and the load on each core. For very codes that saturate the cores with floating point operations usually hyperthreading does not help.

Yes, I think it is the reason why the code performs worse with 10 threads than with 8 threads.