Difference in pointings computed with serial and parallel runs

sgiardie commented 12 months ago

I was testing if there was any difference in running a code using litebird_sim serially or in a parallel way. Using exactly the same code (which was producing a map for 1 year of observation), I noticed a difference in the maps produced at the end. This is the difference between the temperature map in Kelvin produced with the serial run and the same map produced with a parallel run with 8 ranks (on the left) and 4 ranks (on the right). You can notice a ~10^(-12) difference especially on the galactic plane, which corresponds to a ~10^(-6) difference in muK. The difference increases when increasing the number of ranks. Screenshot from 2023-09-08 13-44-26

I tried to see if this difference in the maps was due to a difference in the pointings produced in the parallel and serial case. I run a more simple code with a lower observation time (10 days) in the serial case and in the parallel one (with just two ranks) and compared the corresponding pointings. These are the differences between the colatitude, longitude and polarization angles (the three pointing fields) for the same detector:

Screenshot from 2023-09-12 15-07-02

Screenshot from 2023-09-12 15-02-43

Screenshot from 2023-09-12 15-04-11

You can see that the result of the rank 0 is always the same as the serial result for the corresponding time chunk, while there are differences for the following rank.

I asked @marcobortolami to test this too. We noticed that the difference in the pointing is bigger if one initializes the simulation using start_time=astropy.time.Time(date) instead of start_time=0 (I was using astropy.time in my tests). We also saw that the difference decreases when computing the pointing with double precision instead of single precision (the default), and that the difference in the pointing increases for the ranks corresponding to successive time chunks. @marcobortolami can put an example code and plots showing this.

We guess this is due to an (inevitable?) error in the start time of each time chunk. Is there a way to reduce this error?

mreineck commented 12 months ago

OK ... the huge jumps in phi and polarization angle are wraparounds between 0 and 2pi, so the actual jitter will most likely be much, much smaller. So the errors in theta are probably the best indication of how much the pointings jitter.

The interpretation that this is caused by inaccuracies when computing times sounds reasonable. I don't know how times are added, subtracted etc. in the code, but it's quite possible that there are some conversions going on inside astropy which degrade the accuracy of the results.

Could you please show a plot of the colatitude error when using double precision pointings? That might give an indication how bad the problem really is.

mreineck commented 12 months ago

It might also be helpful to look at the statistics of the deviations: e.g. which fraction of the pointings is actually different between serial and parallel cases. Despite the quite busy plots above, it could still be a very small fraction.

marcobortolami commented 12 months ago

Hi. I confirm what @sgiardie is saying and I put here my code that I run with 1 (serial), 2, 4 and 8 mpi tasks:

import litebird_sim as lbs
import numpy as np
import healpy as hp
import matplotlib.pylab as plt
from astropy.time import Time
import time
import pathlib
import scipy
import pickle
import os
import sys

#for parallelization
comm = lbs.MPI_COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

#initializing the IMO
imo = lbs.Imo()

#initializing the simulation
sim = lbs.Simulation(
    base_path='/marconi_scratch/userexternal/mbortola/litebird/test_lbs/test_pointing/',
    start_time=0.,#Time("2029-01-01T00:00:00"),
    duration_s=1e6,
    mpi_comm=comm,
    random_seed=12345,
)

imo_version = 'v1.3'
telescope   = 'LFT'
base_path   = sim.base_path
start_time  = sim.start_time
duration_s  = sim.duration_s
channels    = ['L2-050']
detnames    = ['000_000_000_QA_050_T']

ndet = np.size(detnames)

comm.barrier()

#loading the instrument metadata
inst_info = sim.imo.query("/releases/"+imo_version+"/satellite/"+telescope+"/instrument_info")

#generating the quaternions of the instrument 
sim.generate_spin2ecl_quaternions(
    imo_url="/releases/"+imo_version+"/satellite/scanning_parameters/",
    delta_time_s=60.,
)

#loading instrument info
inst = lbs.InstrumentInfo(
    name=telescope,
    boresight_rotangle_rad=np.deg2rad(inst_info.metadata["boresight_rotangle_deg"]),
    spin_boresight_angle_rad=np.deg2rad(inst_info.metadata["spin_boresight_angle_deg"]),
    spin_rotangle_rad=np.deg2rad(inst_info.metadata["spin_rotangle_deg"]),
)

#loading channel info #marco : should ch_info have the same lenght as dets?
ch_info = lbs.FreqChannelInfo.from_imo(url="/releases/"+imo_version+"/satellite/"+telescope+"/"+channels[0]+"/channel_info",
                                       imo=imo)

comm.barrier()

dets = []

for i_det in range(ndet):
    #get info from IMO
    det = lbs.DetectorInfo.from_imo(
        url="/releases/"+imo_version+"/satellite/"+telescope+"/"+channels[i_det]+"/"+detnames[i_det]+"/detector_info",
        imo=imo,
    )

    dets.append(det)

comm.barrier()

#create Observation object
(obs,) = sim.create_observations(
    detectors=dets,
    n_blocks_det=1,
    n_blocks_time=size,
    split_list_over_processes=False,
)

#hwp specification
hwp_radpsec = inst_info.metadata["hwp_rpm"]*2*np.pi/60

comm.barrier()

#get pointings and store them in obs
quaternion_buffer = np.zeros((obs.n_samples, 1, 4))

pointings = lbs.pointings.get_pointings(
    obs,
    spin2ecliptic_quats=sim.spin2ecliptic_quats,
    bore2spin_quat=inst.bore2spin_quat,
    hwp=lbs.IdealHWP(hwp_radpsec),   #applies hwp rotation angle to the polarization angle
    quaternion_buffer=quaternion_buffer,
    store_pointings_in_obs=False, #if True, stores colatitude and longitude in obs.pointings,
                                  #and the polarization angle in obs.psi
    dtype_pointing=np.float64,
)

del quaternion_buffer

np.save(
    base_path.absolute().as_posix()+'/pointings_size'+str(size)+'_rank'+str(rank)+'.npy',
    pointings,
)

Below some plots for the colatitude $\theta$ with pointings in double precision and without astropy.

Plot of the colatitude for different tasks (blue 1, orange 2, green 4, red 8) vs number of sample. You can see only red as they agree theta

Plot of first half of samples with 1 task minus first half of samples with 2 tasks (all zero)

Plot of second half of samples with 1 task minus second half of samples with 2 tasks (now you see the difference)

Plot of first 1/8 of samples with 1 task minus first 1/8 of samples with 8 tasks (all zero)

Plot of second 1/8 of samples with 1 task minus second 1/8 of samples with 8 tasks (now you see the difference)

Plot of last 1/8 of samples with 1 task minus last 1/8 of samples with 8 tasks (the difference is larger)

marcobortolami commented 12 months ago

With astropy instead and again in double precision (for the first bunch of data the difference is always zero).

Plot of second half of samples with 1 task minus second half of samples with 2 tasks astropy1

Plot of second 1/8 of samples with 1 task minus second 1/8 of samples with 8 tasks astropy2

Plot of last 1/8 of samples with 1 task minus last 1/8 of samples with 8 tasks (bigger difference) astropy3

sgiardie commented 12 months ago

It might also be helpful to look at the statistics of the deviations: e.g. which fraction of the pointings is actually different between serial and parallel cases. Despite the quite busy plots above, it could still be a very small fraction.

This is the plot (colatitude(parallel) - colatitude(serial)) / colatitude(serial) in single precision Screenshot from 2023-09-13 16-55-01

Basically the same plot for double precision Screenshot from 2023-09-13 17-00-45

The same plot as in my first comment, just in double precision, shows a smaller absolute difference (very crowded plot again) Screenshot from 2023-09-13 16-58-26

mreineck commented 12 months ago

Judging from the plots, this looks very much like an issue with dynamic range of double precision numbers...

Let's assume double precision pointings for the moment,and let's also assume that time is stored in seconds.

If time starts at 0, and mission duration is 1e6 seconds, then the smallest time interval that can be represented by a double precision number is about 1e-10 seconds near the end of the simulated time span. So absolute times may vary (due to numerical precision) by about that much if they are computed in different (but mathematically equivalent) ways, That makes pointing errors around 1e-12 look reasonable.

Now, astropy probably stores times as something like "seconds since Jan 1, 1970", so the simulation time starts at something in the 1e9 range. In this case the smallest representable time interval is around 1e-5 seconds, pushing up the pointing errors to 1e-7.

So the errors come from the dilemma of

storing time in form of huge numbers, but still
trying to represent tiny time differences accurately

marcobortolami commented 11 months ago

I think we understood where the difference was coming from, thanks @mreineck. Should we close this or more discussions / tests are needed?

mreineck commented 11 months ago

It depends on whether the accuracy you get for the actual mission times is "good enough". If not, perhaps we can convince astropy to use something like np.float128 for times internally. But I'm not sure if that can be done.

sgiardie commented 10 months ago

I computed a map for 3 yrs of observation, nside = 256 (the maps in the first comment have nside=64), one in serial mode and one using 2 tasks. I made the difference of the maps (in muK) and it is pretty similar to the difference shown in the maps at the beginning (they were computed for 1 yr of observation and are shown in K). Screenshot from 2023-10-23 11-43-05 I guess a difference of 1e-5 muK on the galactic plane is not a huge concern if we are going to mask anyway. With a 70% mask it seems to go down to 1e-6 muK What do you think?

ziotom78 commented 10 months ago

What do you think?

Thanks all for the detailed analysis and the nice plots! My opinion is that pointing inaccuracies of 10⁻⁶ should not be too worrisome, as the actual instrument will have far larger pointing errors…

Of course, these differences might bite us back once we compare results of different simulations made with different MPI distributions, but unless we find a trivial fix for this, I am not sure it's worth spending too much time to find a solution.

marcobortolami commented 10 months ago

Understood. How do we proceed? We close this Issue as "not planned" or "completed"?

ziotom78 commented 10 months ago

Mmm, I believe the best tag for this would be “WONTFIX” if it's available: we acknowledge that the discrepancy is real, but we accept to live with it.

litebird / litebird_sim

Difference in pointings computed with serial and parallel runs #268