Open chrhansk opened 4 months ago
The matching used in this scaling is documented to be based on MC64 https://github.com/ralna/spral/blob/662c7ac82ad89265b4854fbac8a26b2e2ffb4e38/src/scaling.f90#L936-L938
which itself documents that "If the matrix is structurally singular [...], then the entries of [match] for which there is
no entry on the diagonal of the permuted matrix are set negative" (https://www.hsl.rl.ac.uk/specs/mc64.pdf). So in the example above, row 3
is causing the matrix to be structurally singular and supposedly that's why -3
is returned.
I'm not quite sure how the scaling has to handle such cases. Perhaps the negative entries of match
should just be set to 0 to signal "unmatched", similar to what the auction algorithm does here https://github.com/ralna/spral/blob/662c7ac82ad89265b4854fbac8a26b2e2ffb4e38/src/scaling.f90#L1487-L1488
such that they are skipped in match_postproc
, e.g. the aforementioned offending line.
That is probably a good idea. Seeing as the spral_scaling_hungarian_options
struct has an entry scale_if_singular
, a user would expect that singular matrices can be scaled (albeit with a warning)...
It has been implemented in this way in MC64 so that the rows which cause the matrix to be structurally singular can be identified, but I agree that it probably makes sense to set the negative entries of match
to 0. Happy to accept a PR.
I noticed that this actually already goes wrong before what was mentioned. This was hidden with match being allocated larger than necessary in the example (see #207). When allocating with size m
(on the heap instead of the stack to make it easier to detect with valgrind) like so
/* examples/C/scaling/hungarian_unsym.c - Example code for SPRAL_SCALING */
#include <stdlib.h>
#include <stdio.h>
#include "spral.h"
int main(void) {
/* Derived types */
struct spral_scaling_hungarian_options options;
struct spral_scaling_hungarian_inform inform;
/* Data for unsymmetric matrix:
* ( 2 1 )
* ( 1 4 1 1 )
* ( ) */
int m = 3, n = 5;
int ptr[] = { 0, 2, 4, 5, 5, 6 };
int row[] = { 0, 1, 0, 1, 1, 1 };
double val[] = { 2.0, 1.0, 1.0, 4.0, 1.0, 1.0 };
printf("Initial matrix:\n");
spral_print_matrix(-1, SPRAL_MATRIX_REAL_UNSYM, m, n, ptr, row, val, 0);
/* Perform symmetric scaling */
int* match = malloc(sizeof(int)*m);
double* rscaling = malloc(sizeof(double)*m);
double* cscaling = malloc(sizeof(double)*n);
spral_scaling_hungarian_default_options(&options);
spral_scaling_hungarian_unsym(m, n, ptr, row, val, rscaling, cscaling, match,
&options, &inform);
if(inform.flag<0) {
printf("spral_scaling_hungarian_unsym() returned with error %5d", inform.flag);
exit(1);
}
/* Print scaling and matching */
printf("Matching:");
for(int i=0; i<m; i++) printf(" %10d", match[i]);
printf("\nRow Scaling: ");
for(int i=0; i<m; i++) printf(" %10.2le", rscaling[i]);
printf("\nCol Scaling: ");
for(int i=0; i<n; i++) printf(" %10.2le", cscaling[i]);
printf("\n");
/* Calculate scaled matrix and print it */
for(int i=0; i<n; i++) {
for(int j=ptr[i]; j<ptr[i+1]; j++)
val[j] = rscaling[row[j]] * val[j] * cscaling[i];
}
printf("Scaled matrix:\n");
spral_print_matrix(-1, SPRAL_MATRIX_REAL_UNSYM, m, n, ptr, row, val, 0);
free(rscaling);
free(cscaling);
free(match);
/* Success */
return 0;
}
we already get an error inside of hungarian_match
, without even setting options.scale_if_singular
to true
:
Initial matrix:
Real unsymmetric matrix, dimension 3x5 with 6 entries.
0: 2.0000E+00 1.0000E+00
1: 1.0000E+00 4.0000E+00 1.0000E+00 1.0000E+00
2:
==29266== Invalid write of size 4
==29266== at 0x124E1B: __spral_scaling_MOD_hungarian_match (scaling.f90:1192)
==29266== by 0x126BA1: __spral_scaling_MOD_hungarian_wrapper (scaling.f90:660)
==29266== by 0x12AAFF: __spral_scaling_MOD_hungarian_scale_unsym_int64 (scaling.f90:225)
==29266== by 0x12B0B6: __spral_scaling_MOD_hungarian_scale_unsym_int32 (scaling.f90:201)
==29266== by 0x10F7F5: spral_scaling_hungarian_unsym (scaling.f90:723)
==29266== by 0x1092AB: main (hungarian_unsym.c:27)
==29266== Address 0x4ee02d0 is 4 bytes after a block of size 12 alloc'd
==29266== at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==29266== by 0x109258: main (hungarian_unsym.c:23)
The offending code https://github.com/ralna/spral/blob/7fc5ec3ad16cc8d03578e1e539ffc37eef8ffe44/src/scaling.f90#L1174-L1193 looks dubious to me; for this example with one unmatched and two matched rows, one index is stored in out
and two in jperm
. That leaves three 0-entries in jperm
, causing three indices to be read from out
- despite it only containing one. So we essentially overflow and read what happens to remain in the work array from earlier, which just so happens to be out of bounds for iperm
. I am not sure right now how this is supposed to work.
This "inner" issue definitely has to be fixed before fixing the "outer" issue can then also allow using options.scale_if_singular == true
.
@nimgould any thoughts on what was intended here: https://github.com/ralna/spral/issues/200#issuecomment-2176922979
It is unclear to me what the purpose of this last 1174-1193 section is. The documentation implies that either iperm(i) > 0 gives a match of row i to column iperm(i) or iperm(i) = 0, and there is no match. But this last loop seems to be setting negative values to unmatched rows; what does this mean? And as @mjacobse says there is an inconsistency between the counts for the variable k when marching through the rows (first loop) and columns (second loop). The iperm array becomes the match array in hungarian_scale_unsym & hungarian_scale_sym, and I see no attempt there to correct negative components of match. And hungarian_wrapper is not used anywhere else in spral.
I would be tempted to comment out this segment and see what happens. Perhaps Jonathan had something else in mind?
Reassuring to hear that this section seems strange to you too, many thanks.
My understanding is that this function is from MC64, so while it is true that the SPRAL documentation says that 0 is returned for unmatched rows, MC64 itself returns negative entries according to its documentation (https://github.com/ralna/spral/issues/200#issuecomment-2156065178). And indeed, as it stands the SPRAL documentation is incorrect, the hungarian_scale_[un]sym
functions do return those negative entries as noticed in https://github.com/ralna/spral/pull/205#pullrequestreview-2121255155.
Just returning 0 from hungarian_match
leads to other problems in other SPRAL-internal uses of hungarian_match
that do expect the negative entries, as we noticed in https://github.com/ralna/spral/pull/205#issuecomment-2169174578.
But I tend to agree that it might be easiest to just return 0 and adapt all uses to work with that instead of with negative entries. But to be honest I still do not really understand what further information the negative indices in MC64 provide, so perhaps I am missing how they are crucially important for some usecase.
According to @isduff the negative entries in MC64 are intended to signal which of the rows are problematic as the negative entry is in fact minus the index of the problematic row. Now the question is, does SSIDS actually use this information anywhere? If not, then I agree with @mjacobse that we should just standardise on returning zero for unmatched entries.
All of this makes sense for square matrices, but for rectangular ones a 0 is simply a reflection that one cannot possibly match all rows to columns. But, to me anyway, rank defficiency is more than simply not having enough rows or columns, but that there are zero singular values. I do not see how the scheme before the offending lines (the bulk of the subroutine) distinguishes between the two, and thus the offending lines add nothing extra (except for memory trouble!). Thus, as I said earlier, I would simply comment this out.
Thanks @nimgould, it looks like we are converging on a solution.
Consider the following example problem:
This outputs the following:
The matching has the clearly invalid entry 32522 (random, changing at each iteration). There is also a valgrind error:
The offending line attempts to access the
rscaling
variable at thematch(i)
index, wherei = 3
andmatch = (1, 5, -3)
. It is clearly infeasible to access the array at a negative index, which is causing the initial memory fault. On the other hand, I don't know why the index becomes negative in the first place...