gnosygnu / xowa

xowa offline wiki application
Other
375 stars 41 forks source link

wikidata sortorder #743

Closed desb42 closed 4 years ago

desb42 commented 4 years ago

Looking at page fr.wikipedia.org/wiki/Herbert_C._Brown In the Infobox is a section 'Autres informations'

The mediawiki version shows: sortorder1

the xowa version shows: sortorder2

within the red ellipse, the order of the items (in this case universities) are in a different order

This data originates from WIkidata sortorder3 Note the order of these universities - it matches the order of the mediawiki version of the page

Diving into this further, this ordering is determined by logic within Module:Wikidata, within the function wd.chronoSort

 1 function wd.chronoSort( claims, inverted )
     for _, claim in ipairs( claims ) do
       if not claim.dateSortKey then
         local snack = claim.mainsnak or claim
 5       local iso
         if (snack.snaktype == 'value') and (snack.datatype == 'time') then
           iso = snack.datavalue.value.time
         else
           iso = timeFromQualifs(claim, datequalifiers) or '0'
10       end
         -- transformation en nombre (indication de la base car gsub retourne deux valeurs)
         iso = tonumber( iso:gsub( '(%d)%D', '%1' ), 10 )
         claim.dateSortKey = iso
       end
15   end
     table.sort( 
       claims,
       function ( c1, c2 )
         if inverted then
20         return c2.dateSortKey < c1.dateSortKey
         end
         return c1.dateSortKey < c2.dateSortKey
       end
     )
25   return claims
   end

For these universities there are no time values, so each 'claim.dateSortKey' (line 13) has a value of 0

Starting on line 16, there is a 'table.sort'

I think there might be some subtle luaj issue where the sorting of items where all of them have the same sortkey, the order changes.

btw, the 'Membre de' section also exhibits the same issue

gnosygnu commented 4 years ago

Wow! Nice breakdown!

Let me pull down fr.wikipedia.org and try to reproduce. Will comment tomorrow. Thanks!

gnosygnu commented 4 years ago

Yup, definitely looks a sorting issue. I no-op'd the table.sort by commenting out the sort lines in luaj_xowa. This gave the expected order.

Regarding sorting:

That said, some implementations of quick sort can be stable, and lua may be doing it in its auxsort

Am going to work on converting auxsort to Java and seeing if that works

Kudos for the great intuition!

gnosygnu commented 4 years ago

Based on various posts, it looks like Lua's table.sort is not supposed to be stable. See links below

That said, I changed XOWA to use lua at home/wiki/Special:XowaCfg?grp=xowa.addon.scribunto and the behavior matches MediaWiki.

So, it looks like auxsort is the way to go


desb42 commented 4 years ago

I too, took a look at auxsort and have produced an implementation of Quicksort and MergeSort

I can see now why auxsort looks like a 'stable sort' - it is, for array sizes of three or less!

Please see attached sorts.zip

gnosygnu commented 4 years ago

Cool, thanks! Lua is definitely doing something odd with its sort

Unfortunately, I couldn't get the sorts.zip to pass the tests. It fails with org.luaj.vm2.LuaError: name:3 vm error: java.lang.ArrayIndexOutOfBoundsException: -1

Test method below

It does pass with a traditional QuickSort method (but it is not stable). Another excerpt lower below


    @Test public void sort() {
        testSortBasic("'d', 'c', 'b', 'a'", "1=a;2=b;3=c;4=d;");
        testSortBasic("'e', 'd', 'c', 'b', 'a'", "1=a;2=b;3=c;4=d;5=e;");
        testSortBasic("'a', 'd', 'c', 'b', 'e'", "1=a;2=b;3=c;4=d;5=e;");
        testSortBasic("'a', 'b', 'c', 'd', 'e'", "1=a;2=b;3=c;4=d;5=e;");
    }
    private void testSortBasic(String tblStr, String expd) {
        fxt.Clear();
        fxt.Init__script
        ( "local tbl = {" + tblStr + "};"
        , "table.sort(tbl);"
        , "local rv = '';"
        , "table.foreachi"
        , "( tbl"
        , ", function (k, v)"
        , "    rv = rv .. k .. '=' .. v .. ';'"
        , "  end"
        , ");"
        , "return rv;"
        );
        fxt.Test(expd);
    }
    class Quick {
        private final LuaTable a;
        private final LuaValue cmpfunc;
        public Quick(LuaTable a, LuaValue cmpfunc) {
            this.a = a;
            this.cmpfunc = cmpfunc;
        }
        // REF.CODE:https://algs4.cs.princeton.edu/23quicksort/Quick.java.html
        // quicksort the subarray from a[lo] to a[hi]
        public void sort(int lo, int hi) {
            if (hi <= lo) return;
            int j = partition(lo, hi);
            sort(lo, j-1);
            sort(j+1, hi);
        }
        private int partition(int lo, int hi) {
            int i = lo;
            int j = hi + 1;
            LuaValue v = a.array[lo];
            while (true) {

                // find item on lo to swap
//              while (less(a.array[++i], v)) {
                while (less(++i, lo)) {
                    if (i == hi) break;
                }

                // find item on hi to swap
//              while (less(v, a.array[--j])) {
                while (less(lo, --j)) {
                    if (j == lo) break;      // redundant since a[lo] acts as sentinel
                }

                // check if pointers cross
                if (i >= j) break;

                exch(i, j);
            }

            // put partitioning item v at a[j]
            exch(lo, j);

            // now, a[lo .. j-1] <= a[j] <= a[j+1 .. hi]
            return j;
        }

        // is v < w ?
        private boolean less(int i, int j) {
            if (i == j) return false;   // optimization when reference equals
            return a.compare(i, j, cmpfunc);
        }

        // exchange a[i] and a[j]
        private void exch(int i, int j) {
            LuaValue swap = a.array[i];
            a.array[i] = a.array[j];
            a.array[j] = swap;
        }
    }
desb42 commented 4 years ago

My bad - I did not do a thorough test of any table size above three

Attached is a revised version. This was partly due to my translation of Lua code to Java and forgetting that the array offset is zero in Java and one in Lua

Quicksort.zip

My test data is

{'a', 'a', 'a', 'a', 'a'}
{'a'}
{'b', 'a'}
{'c', 'b', 'a'}
{'d', 'c', 'b', 'a'}
{'e', 'd', 'c', 'b', 'a'}
{'a', 'b'}
{'a', 'b', 'c'}
{'a', 'b', 'c', 'd'}
{'a', 'b', 'c', 'd', 'e', 'b', 'c', 'd', 'e', 'b', 'c', 'd', 'e', 'b', 'c', 'd', 'e', 'b', 'c', 'd', 'e', 'd', 'e', 'b', 'c', 'd', 'e', 'd', 'e', 'b', 'c', 'd', 'e', 'd', 'e', 'b', 'c', 'd', 'e', 'd', 'e', 'b', 'c', 'd', 'e'}
gnosygnu commented 4 years ago

Nice work! I was struggling with translating the C code and getting the tests to pass.

I was able to shoehorn your code to the original C form, but am still trying to figure out the difference in bounds (see the // desb42@ vs ltablib.c sections).

Will spend another hour or so on this later tonight / morning, and if no progress, will take your version as is

Again, thanks! That was quite impressive!

    public void sort(LuaValue comparator) {
        if (m_metatable != null && m_metatable.useWeakValues()) {
            dropWeakArrayValues();
        }
        int n = array.length;
        while ( n > 0 && array[n-1] == null )
            --n;
//      if ( n > 1 )
//          heapSort(n, comparator);
        if ( n > 1 )
            auxsort(this, 0, n - 1, comparator);
//      if ( n > 1 ) {
//          Quick quick = new Quick(this, comparator);
//          quick.sort(0, n - 1);
//      }
//      if ( n > 1 ) {
//          Quicksort ob = new Quicksort(m_metatable, array, n, comparator);
//          ob.sort(n);
//      }
    }
    private void set2 (LuaTable tbl, int i, int j) {
//    lua_rawseti(L, 1, i);
//    lua_rawseti(L, 1, j);
        LuaValue iVal = tbl.array[i];
        LuaValue jVal = tbl.array[j];
        tbl.array[i] = jVal;
        tbl.array[j] = iVal;
    }

    private boolean sort_comp(int a, int b, LuaValue cmpfunc) {
//      if (!lua_isnil(L, 2)) {  /* function? */
//          int res;
//          lua_pushvalue(L, 2);
//          lua_pushvalue(L, a-1);  /* -1 to compensate function */
//          lua_pushvalue(L, b-2);  /* -2 to compensate function and `a' */
//          lua_call(L, 2, 1);
//          res = lua_toboolean(L, -1);
//          lua_pop(L, 1);
//          return res;
//      }
//      else  /* a < b? */
//          return lua_lessthan(L, a, b);
        return compare(a, b, cmpfunc);
    }
    private void auxsort(LuaTable tbl, int l, int u, LuaValue cmpfunc) {
        while (l < u) {  /* for tail recursion */
            int i, j;
            /* sort elements a[l], a[(l+u)/2] and a[u] */
//          lua_rawgeti(L, 1, l);
//          lua_rawgeti(L, 1, u);
            //if (sort_comp(L, -1, -2))  /* a[u] < a[l]? */
            if (sort_comp(u, l, cmpfunc))  /* a[u] < a[l]? */
              set2(tbl, l, u);  /* swap a[l] - a[u] */
//          else
//            lua_pop(L, 2);
            if (u-l == 1) break;  /* only 2 elements */
            i = (l+u)/2;
//          lua_rawgeti(L, 1, i);
//          lua_rawgeti(L, 1, l);
//          if (sort_comp(L, -2, -1))  /* a[i]<a[l]? */
            if (sort_comp(i, l, cmpfunc)) {  /* a[i]<a[l]? */
                set2(tbl, i, l);
            }
            else {
//            lua_pop(L, 1);  /* remove a[l] */
//            lua_rawgeti(L, 1, u);
//            if (sort_comp(tbl, -1, -2))  /* a[u]<a[i]? */
              if (sort_comp(u, i, cmpfunc))  /* a[u]<a[i]? */
                set2(tbl, i, u);
//            else
//              lua_pop(L, 2);
            }
            if (u-l == 2) break;  /* only 3 elements */
//          lua_rawgeti(L, 1, i);  /* Pivot */
//          lua_pushvalue(L, -1);
//          lua_rawgeti(L, 1, u-1);
            // ltablib.c
//          int P = i;
//          set2(tbl, i, u-1); // ltablib.c is u - 1
//          /* a[l] <= P == a[u-1] <= a[u], only need to sort from l+1 to u-2 */
//          i = l; j = u-1; // i = l; j = u-1

            // desb42@
            set2(tbl, i, u); // ltablib.c is u - 1
            /* a[l] <= P == a[u-1] <= a[u], only need to sort from l+1 to u-2 */
            i = l - 1; j = u; // i = l; j = u-1
            for (;;) {  /* invariant: a[l..i] <= P <= a[j..u] */
              /* repeat ++i until a[i] >= P */
//            while (lua_rawgeti(L, 1, ++i), sort_comp(-1, -2)) {
/*
              compare(0, upper); // condition 'b' as Pivot
              while (compare_b(++i) && i <= upper) {
              }
*/
              while (++i < u && sort_comp(i, u, cmpfunc)) {
                if (i>u)
                    error("invalid order function for sorting");
//              lua_pop(L, 1);  /* remove a[i] */
              }
              /* repeat --j until a[j] <= P */
//            while (lua_rawgeti(L, 1, --j), sort_comp( -3, -1)) {
//            while (--j > l && sort_comp(P, j, cmpfunc)) {
/*
                compare(upper, 0); // condition 'a' as Pivot
                while (compare_a(--j) && j > lower) {
                }
 */
              while (--j > l && sort_comp(u, j, cmpfunc)) {
                if (j<l)
                    error("invalid order function for sorting");
//              lua_pop(L, 1);  /* remove a[j] */
              }
              if (j<i) {
//              lua_pop(L, 3);  /* pop pivot, a[i], a[j] */
                break;
              }
              set2(tbl, i, j);
            }
//          lua_rawgeti(L, 1, u-1);
//          lua_rawgeti(L, 1, i);

            // ltablib.c
//          set2(tbl, u, i-1);  /* swap pivot (a[u-1]) with a[i] */

            // desb42@
            set2(tbl, u, i);  /* swap pivot (a[u-1]) with a[i] */
            /* a[l..i-1] <= a[i] == P <= a[i+1..u] */
            /* adjust so that smaller half is in [j..i] and larger one in [l..u] */
            if (i-l < u-i) {
              j=l; i=i-1; l=i+2;
            }
            else {
              j=i+1; i=u; u=j-2;
            }
            auxsort(tbl, j, i, cmpfunc);  /* call recursively the smaller one */
        } /* repeat the routine for the larger one */
    }
desb42 commented 4 years ago

looking at the C code, line 266 is, to me, a clue to this off-by-one thingy

This specifies the range as 1 to n, so the array access is arr[1] ... arr[n-1] whereas luaj uses the call 'auxsort(this, 0, n - 1, comparator);' which is low to high so effectively arr[0] ... arr[high]

Another thing that the C code does (and I attempted to replicate is that the pivot value is calculated once and then compared against, rather than recalculating the pivot value for every comparison)

gnosygnu commented 4 years ago

This specifies the range as 1 to n, so the array access is arr[1] ... arr[n-1] whereas luaj uses the call 'auxsort(this, 0, n - 1, comparator);' which is low to high

Good catch! That was it! Added some extra -1s and that worked. Thanks!

Another thing that the C code does (and I attempted to replicate is that the pivot value is calculated once and then compared against, rather than recalculating the pivot value for every comparison)

Yeah, the C code is very stack-based which allows them to pull the value once. Do you think this will be a big difference? The array[index] should be fine, so it's just a matter of m_metatable.arrayget(index).

Am thinking of including both versions, and adding a low-level option to swap between the two....

gnosygnu commented 4 years ago

Cleaned up the code, and added a few more tests. Decided to default to @desb42's QuickSort version as it's much more readable and optimized. However, the 3-member test fails: testSortBasic("'c', 'b', 'a'", "'a', 'b', 'c'");

It's a little late here, so I'll look at more in the morning. In addition, will add more tests for the cmpfunc to make sure arrayIndexes are passed correctly (base-1 vs base-0) as well as have a test for the "stable" sort (something like the original bug reported above)

gnosygnu commented 4 years ago

So, 1-line fix for 3-member test failure. Debugging is much easier when it's not 3 AM in the morning :)

Anyway, added more tests for the comparator functionality. The LuaTableSortDesb42 looks good. :+1:

Just saying it again, even at the risk of being excessive, but thanks again for all the help! The initial investigation was incredibly thorough and detailed, and the actual port of the C code was great workmanship. I had stared at my version for a while, and couldn't figure out the Pivot part until I saw yours. Well done!