ralna / spral

Sparse Parallel Robust Algorithms Library
https://ralna.github.io/spral/
Other
105 stars 27 forks source link

Memory fault during unsymmetric scaling of singular matrix with Hungarian algorithm #200

Open chrhansk opened 4 months ago

chrhansk commented 4 months ago

Consider the following example problem:

#include <stdlib.h>
#include <stdio.h>
#include "spral.h"

int main(void) {
   /* Derived types */
   struct spral_scaling_hungarian_options options;
   struct spral_scaling_hungarian_inform inform;

   /* Other variables */
   int match[5];
   double rscaling[5], cscaling[5];

   /* Data for unsymmetric matrix:
    * ( 2,  1,  0,  0,  0)
    * ( 1,  4,  1,  0,  1)
    * ( 0,  0,  0,  0,  0)
    */
   int m = 3, n = 5;
   int ptr[]    = { 1, 3, 5, 6, 6, 7};
   int row[]    = { 1, 2, 1, 2, 2, 2};
   double val[] = { 2., 1., 1., 4., 1., 1. };
   printf("Initial matrix:\n");
   spral_print_matrix(-1, SPRAL_MATRIX_REAL_UNSYM, m, n, ptr, row, val, 1);

   /* Perform unsymmetric scaling */
   spral_scaling_hungarian_default_options(&options);

   options.array_base = 1;
   options.scale_if_singular = true;

   spral_scaling_hungarian_unsym(m, n, ptr, row, val, rscaling, cscaling, match,
         &options, &inform);

   if(inform.flag<0) {
      printf("spral_scaling_hungarian_unsym() returned with error %5d", inform.flag);
      exit(1);
   }

   /* Print scaling and matching */
   printf("Matching:");
   for(int i=0; i<n; i++) printf(" %10d", match[i]);
   printf("\nRow Scaling: ");
   for(int i=0; i<m; i++) printf(" %10.2le", rscaling[i]);
   printf("\nCol Scaling: ");
   for(int i=0; i<n; i++) printf(" %10.2le", cscaling[i]);
   printf("\n");

   return 0;
}

This outputs the following:

Initial matrix:
Real unsymmetric matrix, dimension 3x5 with 6 entries.
1:   2.0000E+00   1.0000E+00                                       
2:   1.0000E+00   4.0000E+00   1.0000E+00                1.0000E+00
3:                                                                 
Matching:          1          5         -3      32522         -4
Row Scaling:    8.41e-01   8.41e-01   8.41e-01
Col Scaling:    5.95e-01   2.97e-01   1.19e+00   1.00e+00   1.19e+00

The matching has the clearly invalid entry 32522 (random, changing at each iteration). There is also a valgrind error:

==213605== Invalid write of size 8
==213605==    at 0x49AF0A5: __spral_scaling_MOD_match_postproc (scaling.f90:1675)
==213605==    by 0x49B59AB: __spral_scaling_MOD_hungarian_wrapper (scaling.f90:684)
==213605==    by 0x49B9679: __spral_scaling_MOD_hungarian_scale_unsym_int64 (scaling.f90:225)
==213605==    by 0x49B9C02: __spral_scaling_MOD_hungarian_scale_unsym_int32 (scaling.f90:202)
==213605==    by 0x48EA1D8: spral_scaling_hungarian_unsym (scaling.f90:729)
==213605==    by 0x10936A: main (hungarian_unsym.c:33)
==213605==  Address 0x7260180 is 32 bytes before a block of size 48 in arena "client"

The offending line attempts to access the rscaling variable at the match(i) index, where i = 3 and match = (1, 5, -3). It is clearly infeasible to access the array at a negative index, which is causing the initial memory fault. On the other hand, I don't know why the index becomes negative in the first place...

mjacobse commented 4 months ago

The matching used in this scaling is documented to be based on MC64 https://github.com/ralna/spral/blob/662c7ac82ad89265b4854fbac8a26b2e2ffb4e38/src/scaling.f90#L936-L938

which itself documents that "If the matrix is structurally singular [...], then the entries of [match] for which there is no entry on the diagonal of the permuted matrix are set negative" (https://www.hsl.rl.ac.uk/specs/mc64.pdf). So in the example above, row 3 is causing the matrix to be structurally singular and supposedly that's why -3 is returned.

I'm not quite sure how the scaling has to handle such cases. Perhaps the negative entries of match should just be set to 0 to signal "unmatched", similar to what the auction algorithm does here https://github.com/ralna/spral/blob/662c7ac82ad89265b4854fbac8a26b2e2ffb4e38/src/scaling.f90#L1487-L1488

such that they are skipped in match_postproc, e.g. the aforementioned offending line.

chrhansk commented 4 months ago

That is probably a good idea. Seeing as the spral_scaling_hungarian_options struct has an entry scale_if_singular, a user would expect that singular matrices can be scaled (albeit with a warning)...

jfowkes commented 3 months ago

It has been implemented in this way in MC64 so that the rows which cause the matrix to be structurally singular can be identified, but I agree that it probably makes sense to set the negative entries of match to 0. Happy to accept a PR.

mjacobse commented 3 months ago

I noticed that this actually already goes wrong before what was mentioned. This was hidden with match being allocated larger than necessary in the example (see #207). When allocating with size m (on the heap instead of the stack to make it easier to detect with valgrind) like so

/* examples/C/scaling/hungarian_unsym.c - Example code for SPRAL_SCALING */
#include <stdlib.h>
#include <stdio.h>
#include "spral.h"

int main(void) {
   /* Derived types */
   struct spral_scaling_hungarian_options options;
   struct spral_scaling_hungarian_inform inform;

   /* Data for unsymmetric matrix:
    * ( 2  1         )
    * ( 1  4  1    1 )
    * (              ) */
   int m = 3, n = 5;
   int ptr[]    = { 0,        2,        4,   5,   5,  6 };
   int row[]    = { 0,   1,   0,   1,   1,   1   };
   double val[] = { 2.0, 1.0, 1.0, 4.0, 1.0, 1.0 };
   printf("Initial matrix:\n");
   spral_print_matrix(-1, SPRAL_MATRIX_REAL_UNSYM, m, n, ptr, row, val, 0);

   /* Perform symmetric scaling */
   int* match = malloc(sizeof(int)*m);
   double* rscaling = malloc(sizeof(double)*m);
   double* cscaling = malloc(sizeof(double)*n);
   spral_scaling_hungarian_default_options(&options);
   spral_scaling_hungarian_unsym(m, n, ptr, row, val, rscaling, cscaling, match,
         &options, &inform);
   if(inform.flag<0) {
      printf("spral_scaling_hungarian_unsym() returned with error %5d", inform.flag);
      exit(1);
   }

   /* Print scaling and matching */
   printf("Matching:");
   for(int i=0; i<m; i++) printf(" %10d", match[i]);
   printf("\nRow Scaling: ");
   for(int i=0; i<m; i++) printf(" %10.2le", rscaling[i]);
   printf("\nCol Scaling: ");
   for(int i=0; i<n; i++) printf(" %10.2le", cscaling[i]);
   printf("\n");

   /* Calculate scaled matrix and print it */
   for(int i=0; i<n; i++) {
      for(int j=ptr[i]; j<ptr[i+1]; j++)
         val[j] = rscaling[row[j]] * val[j] * cscaling[i];
   }
   printf("Scaled matrix:\n");
   spral_print_matrix(-1, SPRAL_MATRIX_REAL_UNSYM, m, n, ptr, row, val, 0);

   free(rscaling);
   free(cscaling);
   free(match);
   /* Success */
   return 0;
}

we already get an error inside of hungarian_match, without even setting options.scale_if_singular to true:

Initial matrix:
Real unsymmetric matrix, dimension 3x5 with 6 entries.
0:   2.0000E+00   1.0000E+00                                       
1:   1.0000E+00   4.0000E+00   1.0000E+00                1.0000E+00
2:                                                                 
==29266== Invalid write of size 4
==29266==    at 0x124E1B: __spral_scaling_MOD_hungarian_match (scaling.f90:1192)
==29266==    by 0x126BA1: __spral_scaling_MOD_hungarian_wrapper (scaling.f90:660)
==29266==    by 0x12AAFF: __spral_scaling_MOD_hungarian_scale_unsym_int64 (scaling.f90:225)
==29266==    by 0x12B0B6: __spral_scaling_MOD_hungarian_scale_unsym_int32 (scaling.f90:201)
==29266==    by 0x10F7F5: spral_scaling_hungarian_unsym (scaling.f90:723)
==29266==    by 0x1092AB: main (hungarian_unsym.c:27)
==29266==  Address 0x4ee02d0 is 4 bytes after a block of size 12 alloc'd
==29266==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==29266==    by 0x109258: main (hungarian_unsym.c:23)

The offending code https://github.com/ralna/spral/blob/7fc5ec3ad16cc8d03578e1e539ffc37eef8ffe44/src/scaling.f90#L1174-L1193 looks dubious to me; for this example with one unmatched and two matched rows, one index is stored in out and two in jperm. That leaves three 0-entries in jperm, causing three indices to be read from out - despite it only containing one. So we essentially overflow and read what happens to remain in the work array from earlier, which just so happens to be out of bounds for iperm. I am not sure right now how this is supposed to work.

This "inner" issue definitely has to be fixed before fixing the "outer" issue can then also allow using options.scale_if_singular == true.

jfowkes commented 3 months ago

@nimgould any thoughts on what was intended here: https://github.com/ralna/spral/issues/200#issuecomment-2176922979

nimgould commented 3 months ago

It is unclear to me what the purpose of this last 1174-1193 section is. The documentation implies that either iperm(i) > 0 gives a match of row i to column iperm(i) or iperm(i) = 0, and there is no match. But this last loop seems to be setting negative values to unmatched rows; what does this mean? And as @mjacobse says there is an inconsistency between the counts for the variable k when marching through the rows (first loop) and columns (second loop). The iperm array becomes the match array in hungarian_scale_unsym & hungarian_scale_sym, and I see no attempt there to correct negative components of match. And hungarian_wrapper is not used anywhere else in spral.

I would be tempted to comment out this segment and see what happens. Perhaps Jonathan had something else in mind?

mjacobse commented 3 months ago

Reassuring to hear that this section seems strange to you too, many thanks.

My understanding is that this function is from MC64, so while it is true that the SPRAL documentation says that 0 is returned for unmatched rows, MC64 itself returns negative entries according to its documentation (https://github.com/ralna/spral/issues/200#issuecomment-2156065178). And indeed, as it stands the SPRAL documentation is incorrect, the hungarian_scale_[un]sym functions do return those negative entries as noticed in https://github.com/ralna/spral/pull/205#pullrequestreview-2121255155.

Just returning 0 from hungarian_match leads to other problems in other SPRAL-internal uses of hungarian_match that do expect the negative entries, as we noticed in https://github.com/ralna/spral/pull/205#issuecomment-2169174578.

But I tend to agree that it might be easiest to just return 0 and adapt all uses to work with that instead of with negative entries. But to be honest I still do not really understand what further information the negative indices in MC64 provide, so perhaps I am missing how they are crucially important for some usecase.

jfowkes commented 3 months ago

According to @isduff the negative entries in MC64 are intended to signal which of the rows are problematic as the negative entry is in fact minus the index of the problematic row. Now the question is, does SSIDS actually use this information anywhere? If not, then I agree with @mjacobse that we should just standardise on returning zero for unmatched entries.

nimgould commented 3 months ago

All of this makes sense for square matrices, but for rectangular ones a 0 is simply a reflection that one cannot possibly match all rows to columns. But, to me anyway, rank defficiency is more than simply not having enough rows or columns, but that there are zero singular values. I do not see how the scheme before the offending lines (the bulk of the subroutine) distinguishes between the two, and thus the offending lines add nothing extra (except for memory trouble!). Thus, as I said earlier, I would simply comment this out.

jfowkes commented 3 months ago

Thanks @nimgould, it looks like we are converging on a solution.