pace-neutrons / Horace

Horace is a suite of programs for the visualization and analysis of large datasets from time-of-flight neutron inelastic scattering spectrometers.
https://pace-neutrons.github.io/Horace/stable/
GNU General Public License v3.0
8 stars 5 forks source link

User reports symetrisation does not work on gen_sqw parallel execution #287

Closed abuts closed 4 years ago

abuts commented 4 years ago

Dear Alex,

I believe the parallel running of gen_sqw with @transform_sqw is broken, at least for my account, but it may also be broken for everyone.

I have tested gen_sqw with @transform_sqw with a Horace internal function (so no issues with paths etc.)

gen_sqw(spefile, par_file, sym_name(sqw_file,n), efix, emode, alatt, angdeg,...

u, v, psi, omega, dpsi, gl, gs,[50,50,50,50],urange_in,'transform_sqw',@(x)(symmetrise_sqw(x,[1/2 -1 0],[0 0 1],[0 0 0])),'tmp_only',true);%make the sqw file

and the result is the same as when we use our own symmetrisation routine, the code works correctly in sequential mode but does not work in parallel mode, see below Figure 1 (sequential running) shows all pixels for y<0 are mapped for y>0, whereas figure 2 (parallel running) has performed no symmetrisation. I tested this for both isiscompute and IDAaaS, same behaviour. It looks like the @transform_sqw option is simply ignored under parallel running, so this needs to be fixed first, then I think our code will probably run ok. This functionality was broken recently as it worked for us a few weeks ago.

Can you run the code below from your account and see if you get Figure 1 (correctly symmetrised) or Figure 2 (not symmetrised), i.e. is this a problem with my account or is it a fundamental issue with parallel operation that is affecting everyone.

Radu

image

image

cd /instrument/MERLIN/RBNumber/RB1820500/

addpath('/instrument/LET/RBNumber/RB1810670/mslice');

addpath('/instrument/LET/RBNumber/RB1810670/mslice/horace_addons');

MakeSQWfile_CoTiO3_18meV_feb19(7,[],'_test',[1 250]);

ext='_feb19_test_sym7';[s,s1]=cut_sqw2(sprintf('%s%s%s','/instrument/MERLIN/RBNumber/RB1820500/CoTiO3_18meV',ext,'.sqw'),struct('u',[0.5,-1,0],'v',[1,0,0],'uoffset',[0,0,0,0],'type','ppr'),[-3.3,0.0075,3.3],[-1,0.0075,3.3],[-5,5],[8.25-5,8.75+5]); cax=[10 600];plot_slice(smooth_slice(s,1),cax(1),cax(2),'flat',ms_get_colormap(1),'linear');brillouin_zones_tr(1);drawnow;

From: Radu Coldea Sent: 11 July 2020 14:32 To: Alex Buts Alex.Buts@stfc.ac.uk Subject: RE: problem running gen_sqw with @transform_sqw for symmetrisation

Dear Alex,

I have already tried the option 3 to run the code serially in the most up to date version on isiscompute, it took 4.2 hours and the final sqw file is correctly symmetrised, as far as I can tell in sequential operation gen_sqw works correctly. It is the parallel part that does not work. What is surprising is that no errors come up when running in the most up to date version, the symmetrisation bit is simply ignored.

I suspect as you inferred the issue is with the workers/paths, something like this. In the gen_sqw code the @transform_sqw refers to function symmetrisation_routine which is located in the experimental folder /instrument/MERLIN/RBNumber/RB1820500, so when I start the MakeSQW job from this folder this function is there, my question is when individual jobs are passed to workers in which folders do those processes start, the same folder as the parent or different, will the function symmetrisation_routine be known to them?

When NoMachine access was stopped a few months ago and we ported everything to be able to work on IDAaaS, we edited symmetrisation_routine to add the correct paths

addpath('/instrument/LET/RBNumber/RB1810670/mslice');

addpath('/instrument/LET/RBNumber/RB1810670/mslice/horace_addons');

addpath('/instrument/MERLIN/RBNumber/RB1820500/');

and the parallel code then worked fine on both isiscompute and IDAaaS with the “default” workers, as you advised in the past I did not want to make any changes to the default worker files, but make any changes require in our own code. Then we tested everything on both IDAaaS and isiscompute and packaging/symmerisation all worked fine, the same worked fine a few weeks go, but when I tried it yesterday the symmetrisation part did not work in parallel mode.

I suspect the issue with parallel processing using gen_sqw with custom transform_sqw functions is not just related to us but may be a more general issue that could be affecting others as well. If possible it would be best to try and solve this using the default worker files, i.e. without interfering with those. Can you run gen_sqw in parallel mode ok, could you test the @transform_sqw option with any custom operation that is in a function that is not part of the default Horace package, it could be anything, for example putting all intensities to be a constant. Can you get that simple case working for one single nxspe file or two packaged nxspe files in your account, keep that file first in your default starting folder and see if that works, then move to a different folder and get things working there or identify why is not working, then if this simple packaging works for you we try to replicate what you did in our case, if it does not work for you then it is a deeper problem that will affect everyone who wants to use custom functions.

Radu

From: Alex Buts Alex.Buts@stfc.ac.uk Sent: 11 July 2020 13:00 To: Radu Coldea radu.coldea@physics.ox.ac.uk Subject: Re: problem running gen_sqw with @transform_sqw for symmetrisation

Dear Radu,

That's the issue with parallel jobs. My main task during last 3 month is to address this issue. Can not say I have fully resolved the issue but achieved significant progress on the way.

It is fully separate from symmetization, as parallel just gives boost in performance.

The changes to file system, seems broken old parallelization - that's known, but I do not remember what works in old Horace and what does not. We need to investigate.

Three steps to to check that:

1) I seems made typo yesterday enabling old Horace for you. I remember you using you own worker_v1, which enables your custom code - -- this one needs to be modified.

Modify your worker_v1 to work as the standard one and try to run your job again. It the job have not reported changes in status in 5 min (or time to symetrize single nxspe file) its broken.

2) changing parallel framework -- may work.

After your worker is modified

try

pc = parallel_framework.

pc.parallel_framework = 'parpool'

Start gen_sqw and If the job is not changing its status in 5 min -- its dead. The fixture is in the new code incompatible with old Horace.

3) run symetrization code serially. This would allow you to see if old Horace produces correct symmetrization and new one -- does not. Have no idea what have changed there -- nothing should.

But at least I will know that there is the problem in Horace to address.

Regards,

Alex

On 11/07/2020 09:31, Radu Coldea wrote:

Dear Alex,

The MakeSQW job I started after old_horace_init is still in the same state after many hours, it has not progressed beyond the “state unknown” stage see below, I suspect some error was encountered and the process is stalled, could you see the output of the workers to help identify the problem?   

Radu

+++++++++++++++++++++++++

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

.............................................

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

.............................................

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

.............................................

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

.............................................

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

.............................................

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

.............................................

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

.............................................

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

.............................................

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

.............................................

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

.......

From: Radu Coldea
Sent: 11 July 2020 01:45
To: Alex Buts <Alex.Buts@stfc.ac.uk>
Subject: RE: problem running gen_sqw with @transform_sqw for symmetrisation

Dear Alex,

I started a new matlab from a different folder such that it does not execute my startup file, used cd ~ ; old_horace_init  then updated path to use my mslice code etc. and stated the process for 2 files and it take a very long time, I suspect it is not actually doing anything, it should complete 2 files pretty quickly, see below. Am I doing something wrong?

MakeSQWfile_CoTiO3_18meV_feb19(7,[],[],[1 250]);

(empty) temporay working directory already exists

/instrument/MERLIN/RBNumber/RB1820500/pid929269

start preparing 2 .tmp files between 16 worker(s)

:herbert configured: *** Starting Herbert (poor-man-MPI) cluster with 2 workers ***

*** Herbert cluster started                                 ***

**** starting parallel job: gen_sqw_CoTiO3_1_CGUAXSMOFL

.............................................

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

.............................................

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

.............................................

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

.............................................

***Job : gen_sqw_CoTiO3_1_CGUAXSMOFL : state: unknown |

In contrast with the current version the commands run very quickly, but the end file is not symmetrized. Is there any way of getting hold of the output of the various workers, to see if there are any errors, also if symmetrisation is performed there should messages like

calling symmetrisation_routine with flag=7

Transformed coordinates for 67352/15512400 (0.434182%) pixels with (k<0)&((h+k)<0)

via rotation by 120 deg // [0 0 1] in reciprocal (a*,b*,c*) axes

equivalent (h,k,l) transformation matrix =[-1 -1 0;1 0 0;0 0 1]

Transformed coordinates for 4107146/15512400 (26.4765%) pixels with (k>0)&(h<0)

via rotation by -120 deg // [0 0 1] in reciprocal (a*,b*,c*) axes

equivalent (h,k,l) transformation matrix =[0 1 0;-1 -1 0;0 0 1]

Transformed coordinates for 5824880/15512400 (37.5498%) pixels with (h>0)&(k>0)&(l>0)

via (h,k,l) transformation matrix =[1 1 -0;-1 -0 -0;-0 -0 -1]

equivalent to inversion + rotation by 120 deg // [0 0 1] in reciprocal (a*,b*,c*) axes

Transformed coordinates for 1956880/15512400 (12.6149%) pixels with ((h+k)>0)&(k<0)&(l>0)

via (h,k,l) transformation matrix =[-0 -1 -0;1 1 -0;-0 -0 -1]

equivalent to inversion + rotation by 120 deg // [0 0 -1] in reciprocal (a*,b*,c*) axes  

Radu  

From: Alex Buts <Alex.Buts@stfc.ac.uk>
Sent: 11 July 2020 01:15
To: Radu Coldea <radu.coldea@physics.ox.ac.uk>
Subject: Re: problem running gen_sqw with @transform_sqw for symmetrisation

Dear Radu,

Last changes to the Horace, you using on isiscompute by default are from May 26.

Nobody touched symmetrisation so all should be fine but who knows. For your usage and as reference, there is very old code installed on ISISCOMPUTE (from January 13, 2020).

I've placed the code, which initializes  this horace in your root folder  (in Matlab : cd ~ ; old_horace_init). Please, try this one.

If there is difference, we should dig dipper though I have no idea where it could be. If not -- something have changed on your side.

Regards,

Alex

On 10/07/2020 18:41, Radu Coldea wrote:

    Dear Alex,

    Has anything changed recently with the way gen_sqw calls transform_sqw functions under parallel mode? For some reason our MakeSQW code with symmetrisation does not symmetrise anymore, the code runs with no error messages but no symmetrisation operations are performed whatsoever on the files, with exactly the same problem on IDAaaS and isiscompute. I am sure this worked fine a few weeks ago. For example:

    cd /instrument/MERLIN/RBNumber/RB1820500

    MakeSQWfile_CoTiO3_18meV_feb19(7);

    should run the sqw packaging code and execute the  symmetrisation_routine(w,7) on every nxspe data set, but the final sqw data set is not symmetrised, see below, we should get Figure 1, instead we get Figure 2, exactly the same as for the completely bare, unsymmetrised data set in Figure 3. Is there some incompatibility problem with the calling syntax or the paths?

    Sequential running without hpc on seems to run through the symmetrisation ok (I tested for 2 files but it would take forever to do the full 250 files), so the problem is with the running of the code in parallel mode. For a quick test of packaging 2 files only, you can run the code below from the above location once you have executed my startup file, Figure 4 is with the correct (sequential) run, Figure 5 is with the parallel run when no symmetrisation/folding occurs.

    MakeSQWfile_CoTiO3_18meV_feb19(7,[],[],[1 250]);

    ext='_feb19_sym7';[s,s1]=cut_sqw2(sprintf('%s%s%s','/instrument/MERLIN/RBNumber/RB1820500/CoTiO3_18meV',ext,'.sqw'),struct('u',[0.5,-1,0],'v',[1,0,0],'uoffset',[0,0,0,0],'type','ppr'),[-3.3,0.0075,3.3],[-1,0.0075,3.3],[-5,5],[8.25-5,8.75+5]); cax=[10 600];plot_slice(smooth_slice(s,1),cax(1),cax(2),'flat',ms_get_colormap(1),'linear');brillouin_zones_tr(1);drawnow;

    Radu  

    From: Alex Buts <Alex.Buts@stfc.ac.uk>
    Sent: 09 July 2020 23:32
    To: Radu Coldea <radu.coldea@physics.ox.ac.uk>
    Subject: Re: hpc does not work on isiscompute

    you had your own version of parallel_config.m and gen_sqw.m files in your startup folder.

    as such, they overwrite any standard versions present in Herbert and Horace and do not work.

    I have renamed them to parallel_config.bak and gen_sqw.bak and it works now. At least horace_on()

    did not tried startup itself (its my modified statup you are sending to me -- I've just commented ISIS mslice from it.

    hpc on works too.

    Regards,

    Alex

    On 09/07/2020 23:15, Radu Coldea wrote:

        This is the file /home/rc67/startup.m

        if 1<0

            mslice_on();

            herbert_on();

            horace_on();

            %addpath('/usr/local/mprogs/Mslice')

            %mslice_init

        else

            %horace_on;

            %herbert_on;

        end

        format short g;

        local='/instrument/LET/RBNumber/RB1810670/';

        addpath([local 'mfit4']);

        addpath([local 'mview4']);

        addpath([local 'load']);

        addpath([local 'nllsq']);

        addpath([local 'funcs']);

        addpath([local 'mslice']);

        addpath([local 'mslice/horace_addons']);

        addpath([local 'spinwave']);

        %addpath([local 'spinw']);

        set(0,'DefaultAxesFontSize',12);

        set(0,'DefaultAxesFontName','times');

        set(0,'DefaultFigurePaperType','a4');

        fprintf('Startup directory is:\n%s%s\n',pwd,filesep);

        %###SW_UPDATE

        % Path to the SpinW installation

        %addpath(genpath('/instrument/LET/RBNumber/RB1810670/spinw'));

        %###SW_UPDATE

        I am not initializing any isis mslice code and

        which mslice gives

        the correct path to my code /instrument/LET/RBNumber/RB1810670/mslice/mslice.m

        Radu

        From: Alex Buts <Alex.Buts@stfc.ac.uk>
        Sent: 09 July 2020 23:07
        To: Radu Coldea <radu.coldea@physics.ox.ac.uk>
        Subject: Re: hpc does not work on isiscompute

        Yes, I've edited this one.

        you are still initializing  isis version of Mslice in startup. Probably its better to remove it too, to avoid any conflicts with Horace and your code.

        let me see what I can do -- it works for me but if I log as you through terminal, it indeed fails.

        Something wrong with your config. -- need to play a bit more.

        Regards,

        Alex

        On 09/07/2020 23:00, Radu Coldea wrote:

            Dear Alex,

            I started a new matlab session from the terminal and I get the same error as before. Which startup file did you edit /home/rc67/startup.m?

            I am not using the mslice version provided by isis, I have my own mslice code.

            Radu

            From: Alex Buts <Alex.Buts@stfc.ac.uk>
            Sent: 09 July 2020 22:53
            To: Radu Coldea <radu.coldea@physics.ox.ac.uk>
            Subject: Re: hpc does not work on isiscompute

            Dear Radu,

            you startup.m in your folder is incorrect any more.

            I have modified it to correct version but have to say that I've been told not to support mslice any more so it may conflict with Horace/Herbert in nearest future.

            Within PACE mslice compatibility routines have been also removed. Mslice still should work with Horace/Herbert, but may get broken (conflicting with Horace/Herbert) any time.

            It will still work separately.

            Regards,

            Alex

            On 09/07/2020 22:31, Radu Coldea wrote:

                Dear Alex,

                I just tried packaging some files on isiscompute and hpc does not seem to be working, I get error messages

                Undefined function 'hpc' for input arguments of type 'char'.

                Undefined function or variable 'write_nsqw_to_sqw'.

                Could you look into this, do I need to update something in my startup file or is the problem on the hpc setup on isiscompute?  

                Thank you,

                Radu

                -----Original Message-----
                From: Alex Buts - UKRI STFC <alex.buts@stfc.ac.uk>
                Sent: 24 May 2020 09:28
                To: Radu Coldea <radu.coldea@physics.ox.ac.uk>
                Subject: Re: hpc does not work on isiscompute or IDAaaS

                Hi, Radu.

                Thanks.

                I have recently (yesterday) fixed it but with new workflow it takes some time to put it in production.

                I will do it urgently today, though general issues with parallel file system make this command not  very efficient.

                Regards,

                Alex

                On 23 May 2020 21:46, Radu Coldea <radu.coldea@physics.ox.ac.uk> wrote:

                Dear Alex,

                We are trying to package a set of simulated nxspe files into sqw but get an error message when the hpc is initialized, see error reproduced below on a freshly started matlab.

                We get the same on both isiscompute and IDAaaS, this is something recent. Has something changed in the way hpc is setup/managed on isiscompute/IDAaaS, do we have to have something specific in our startup.m, can you let us know, just now we only have

                    horace_on;

                    herbert_on;

                is this sufficient, are those the recommended commands?

                Thank you,

                Radu and Miska

                +++++++++++++++++++++++++

                Startup directory is:

                /home/rc67/

                >> cd /instrument/MERLIN/RBNumber/RB1820500/

                >> hpc on

                No appropriate method, property, or field 'hpc_config' for class 'opt_config_manager'.

                Error in find_hpc_options (line 19)

                hpc_cfg = config.hpc_config;

                Error in hpc (line 41)

                        find_hpc_options(hpc_options_names,'-set_config');
abuts commented 4 years ago

fixed mainly in Herbert. Only tests are here.