sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.43k stars 479 forks source link

AG codes and decoders #27957

Closed kwankyu closed 3 years ago

kwankyu commented 5 years ago

We add AG codes and unique decoders for AG codes.

In Attachments below, there are Jupyter notebooks demonstrating the new features provided by this ticket. These were presented in 2019 SIAM Conference on Algebraic Geometry.


The author was supported by NRF of Korea 2018, 2019, 2020

Component: coding theory

Author: Kwankyu Lee

Branch/Commit: 7097dc7

Reviewer: Travis Scrimshaw

Issue created by migration from https://trac.sagemath.org/ticket/27957

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

6f7dca1Fix a typo
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from c7df4e6 to 6f7dca1

tscrim commented 3 years ago
comment:38

So I have done a bit of refactoring to improve readability and to try and improve the speed. The biggest thing I did is increase the amount of things that know what type they are. I also refactored out some common code and did some improvements where I could. Without knowing more about what could be be taking time, I cannot suggest how to further improve the speed. Hopefully this has given things a noticeable speedup.

However, one thing that does need to be addressed is the length of the doctests. For # long time, everything is fine. So one option is to mark a lot of the doctests that way. Would it be possible to include some smaller examples in order to increase the doctesting speed?

With regards to the documentation, it is not just there for users. It is also useful for developers/maintainers who will be looking at this code. While it is an implementation detail, it is a fairly macro detail. Including it just makes it less opaque for someone who is trying to figure out how things are working. Sage has many similar implementation detail classes included in its documentation (e.g., polydict).


edit - Forgot to squash my commits...

tscrim commented 3 years ago

Reviewer: Travis Scrimshaw

tscrim commented 3 years ago

Changed commit from 6f7dca1 to d415fb7

tscrim commented 3 years ago

Changed branch from u/klee/27957 to public/coding_theory/AG_codes-27957

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

0b40831Including some stronger typing and other refactoring.
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from d415fb7 to 0b40831

kwankyu commented 3 years ago
comment:40

As far as I understand, Py_ssize_t is the type for index variables. Is there a reason that you used that type for non-index variables?

tscrim commented 3 years ago
comment:41

It is like an int but it uses the compiler sized version. So it works for things that would be too big for an int on computers that can handle that at the CPU level. Granted, I doubt anyone will ever need that big, but it is better to support it IMO (and we have to worry less about what things should be an int and what shouldn't).

kwankyu commented 3 years ago
comment:42

Replying to @tscrim:

It is like an int but it uses the compiler sized version. So it works for things that would be too big for an int on computers that can handle that at the CPU level. Granted, I doubt anyone will ever need that big, but it is better to support it IMO (and we have to worry less about what things should be an int and what shouldn't).

Then I would use int type since (much) readibility is lost by using unclearly-named type Py_ssize_t just for common integer variables that never would contain that big integers.

My general feeling for your changes is similar and such that readibility of the code is decreased while gaining speed improvement. A simple test showed that you gained another 10% speed improvement.

I will try to recover readibility while not losing much of the speed improvement that you already gained.

kwankyu commented 3 years ago
comment:43

Replying to @tscrim:

However, one thing that does need to be addressed is the length of the doctests. For # long time, everything is fine. So one option is to mark a lot of the doctests that way. Would it be possible to include some smaller examples in order to increase the doctesting speed?

The examples are already smallest without being trivial. I would use # long time.

With regards to the documentation, it is not just there for users. It is also useful for developers/maintainers who will be looking at this code. While it is an implementation detail, it is a fairly macro detail. Including it just makes it less opaque for someone who is trying to figure out how things are working. Sage has many similar implementation detail classes included in its documentation (e.g., polydict).

PolyDict is not a comparable example. PolyDict is a tool to be used by developers. _Decoder_K is not such one. Anyway, I think we can compromise. I can remove the leading underline so that the class (and _Decoder_K_extension as well) is included in the documentation, but instead would hide some methods that are not worth to be included in the documentation.

kwankyu commented 3 years ago

Changed commit from 0b40831 to none

kwankyu commented 3 years ago

Changed branch from public/coding_theory/AG_codes-27957 to public/coding/ag-codes-27957

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

e03bf5eAdd AG codes and decoders
c5baba9Merge branch 'develop' into ag-codes-trac27957
eeba013Fix a bug in constant field extension of AG code"
241302dMerge branch 'develop' into ag-codes-trac27957
c7df4e6Fixes for reviewer comments
6f7dca1Fix a typo
0b40831Including some stronger typing and other refactoring.
625d9f1Merge branch 'public/coding_theory/AG_codes-27957' of git://trac.sagemath.org/sage into ag-codes-trac27957
1e8b981Improve readibility
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Commit: 1e8b981

kwankyu commented 3 years ago
comment:47

I restored the structure of the code close to the original while keeping the most of your speed improvements. A little bit of speed was lost, which I guess is due to converting inline functions to methods. I removed some castings which I think unnecessary, to improve readibility.

tscrim commented 3 years ago
comment:48

So I mostly agree. You can still make the methods inline IIRC. Yet, I don't think there isn't any reason to remove a casting as it decreases the readability of the code since now we don't know what type of object it should be. This also has an effect on speed because the C code becomes more complicated.

Those decorators on the helper functions are also important for speed as they remove a number of safety checks that performed to prevent crashes that we are guaranteeing are not happening. They have no effect on readability IMO.

Also, these are useful because it simplifies how Cython translates it into C code:

-                    wlt = gamma * dlt + <Py_ssize_t> (dR[i])
+                    wlt = gamma * dlt + dR[i]

(Ideally, I would want things like dR to be C arrays, but then we have to manually deal with pickling.)

One other thing that might be useful would be having the degree return an actual int object.

If you want to look at the compiled C code, you can find it at

$SAGE_ROOT/build/pkgs/sagelib/src/build/cythonized/sage/coding/ag_code_decoders.c

Just search for a line number of line of code in the pyx file (e.g., 1135 but not 1092 or 1124). You don't necessarily have to understand it, but you can get a feel of the complexity, which roughly corresponds to time needed.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 1e8b981 to 837b276

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

837b276Sprinkle long time to doctests
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

9eabc9fFixes for reviewer comments
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 837b276 to 9eabc9f

kwankyu commented 3 years ago
comment:51

Replying to @tscrim:

So I mostly agree. You can still make the methods inline IIRC.

I didn't know that. Thanks.

I don't think there is any reason to remove a casting as it decreases the readability of the code since now we don't know what type of object it should be. This also has an effect on speed because the C code becomes more complicated.

I am thinking of the readability for someone who looks at the code to understand the algorithm. Castings in the middle of code does not help reading the code. I want to balance speed and readability. In particular I want to make the code look more like Python than C.

But I was careful not to remove necessary castings. I tested before I remove a casting to see the effect. I may have missed some, though I restored a few in the last commit.

Those decorators on the helper functions are also important for speed as they remove a number of safety checks that performed to prevent crashes that we are guaranteeing are not happening. They have no effect on readability IMO.

I didn't know that. Restored. Thanks.

Also, these are useful because it simplifies how Cython translates it into C code:

-                    wlt = gamma * dlt + <Py_ssize_t> (dR[i])
+                    wlt = gamma * dlt + dR[i]

Okay. Restored.

One other thing that might be useful would be having the degree return an actual int object.

Done.

If you want to look at the compiled C code, you can find it at

$SAGE_ROOT/build/pkgs/sagelib/src/build/cythonized/sage/coding/ag_code_decoders.c

Just search for a line number of line of code in the pyx file (e.g., 1135 but not 1092 or 1124). You don't necessarily have to understand it, but you can get a feel of the complexity, which roughly corresponds to time needed.

Thanks. I am testing with short examples with %%cython --annotate.


New commits:

9eabc9fFixes for reviewer comments
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

0e68c5dOptimize _substitution method
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 9eabc9f to 0e68c5d

kwankyu commented 3 years ago
comment:53

A profiling showed that the most time consuming part of the decoder is _substitution method. So I tried to optimize the code for speed in the last commit.

According to simple tests, the performance of the decoder has improved after each revision after cythonization.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

2dddef3Trying to further optimize _substitute().
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 0e68c5d to 2dddef3

tscrim commented 3 years ago
comment:55

Thank you for the further changes and testing for the timing. I tried to really optimize the _substitute as much as I could by not creating as many transient objects by again modifying the input. Likewise, I just changes gbmat into mat as the list of (row) vectors is what you really wanted to manipulate.

From looking at the C code, there was no different with the extra <Polynomial> casts in _substitute(). So I just removed them for readability.

Some of the other changes I made in accordance with the change to mat might affect readability for trivial speed improvements; so feel free to revert.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 2dddef3 to 21d3f89

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

21d3f89Trivial fixes
kwankyu commented 3 years ago
comment:57

Replying to @tscrim:

I tried to really optimize the _substitute as much as I could by not creating as many transient objects by again modifying the input. Likewise, I just changes gbmat into mat as the list of (row) vectors is what you really wanted to manipulate.

This is the timings for a simple test:

|| pure python || rev1 || rev2 ||rev3 || rev4 || this revision ||
|| 25.9ms || 25.1ms || 24.8ms || 24.5ms || 24.1ms ||15.9ms ||

So you achieved a lot with the last patch. Thank you!

From looking at the C code, there was no different with the extra <Polynomial> casts in _substitute(). So I just removed them for readability.

Okay. I am getting intoxicated with what cythonization brings and caring readibility less :-)

Still I removed, for instance, <FreeModuleElement> from

temp = <FreeModuleElement> (<list> mul_mat[j])[i]

since temp is already declared as <FreeModuleElement> type before, and hence I think casting is unnecessary. I wonder if you would agree.

tscrim commented 3 years ago
comment:58

When you do <Foo> bar in Cython, it doesn't explicitly do a cast, but it tells Cython that I (the programmer) guarantee that this object is an instance of Foo. So when used improperly, it can cause a crash. However, when you simply do temp = bar, Cython does a check of the type to make sure it is correct and errors out. While this check is fast, it can matter in a tighter loop. So in _substitute(), because it is the key part, I would suggest leaving it in. However, that is up to you.

Something else that might be done to optimize things is to have message be backwards in the else branch of the code to avoid doing message.insert(0, foo). This is really bad as it has to move everything along in memory since it is stored as a C vector (which is basically an array). So just make these changes:

-message.insert(0, winner)
+message.append(winner)
-message.insert(0, value)
+message.append(value)

and then add a message.reverse() at the end of the else: block.

kwankyu commented 3 years ago
comment:59

Replying to @tscrim:

When you do <Foo> bar in Cython, it doesn't explicitly do a cast, but it tells Cython that I (the programmer) guarantee that this object is an instance of Foo. So when used improperly, it can cause a crash. However, when you simply do temp = bar, Cython does a check of the type to make sure it is correct and errors out. While this check is fast, it can matter in a tighter loop. So in _substitute(), because it is the key part, I would suggest leaving it in.

I see. Perhaps I misunderstood it when I experimented with a primitive type like <int>, rather than an extension type.

I would put them back then.

Something else that might be done to optimize things is to have message be backwards in the else branch of the code to avoid doing message.insert(0, foo). This is really bad as it has to move everything along in memory since it is stored as a C vector (which is basically an array). So just make these changes:

-message.insert(0, winner)
+message.append(winner)
-message.insert(0, value)
+message.append(value)

and then add a message.reverse() at the end of the else: block.

Okay. I will do it.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 21d3f89 to e333ffe

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

33fc493Put back some removed castings
e333ffeReverse message construction
kwankyu commented 3 years ago
comment:61

The new timing is

15.5ms
kwankyu commented 3 years ago
comment:62

Is there anything else to do?

tscrim commented 3 years ago
comment:63

I think it is good. Although I am just a little nervous about the pickling solution. I think it is a good solution, but I am not entirely sure. I think we should ask on sage-devel about it.

kwankyu commented 3 years ago
comment:64

Replying to @tscrim:

I think it is good. Although I am just a little nervous about the pickling solution. I think it is a good solution, but I am not entirely sure. I think we should ask on sage-devel about it.

As far as I understand, an object is constructed from the pickled data and uses the current code. I think it is possible in principle that a change in a different part of Sage may affect the constructor of the object and hence a newly constructed object may behave differently from an unpickled object. However, in our case, the decoder's data is all of basic types such as finite field elements, polynomials, matrices , vectors, etc. Hence it is unlikely that the data of a newly constructed object is different from an unpickled data. Moreover even if it happens, it would be detected everywhere in Sage.

Do you still want to discuss this matter in sage-devel?

tscrim commented 3 years ago
comment:65

I think we should because what if the pickles get replaced that has something evil included (say, some extra method that is monkey-patched into __init__ that deletes everything on the system). We also haven't done anything like this in Sage (at least since we removed the pickle jar). So I just want to be a bit cautious here before proceeding.

kwankyu commented 3 years ago
comment:66

Replying to @tscrim:

I think we should

Okay.

what if the pickles get replaced that has something evil included.

The pickles would be part of the Sage lib. Why pickles are more vulnerable than other code? Do you imagine that some evil author and reviewer cooperate to slip in the evil code passing the release manager's eyes?

say, some extra method that is monkey-patched into __init__ that deletes everything on the system

Is this possible with a pickle? As far as I know, a pickle does not contain code.

We also haven't done anything like this in Sage (at least since we removed the pickle jar). So I just want to be a bit cautious here before proceeding.

I agree. I also wondered if there should be a central place to hold the pickles for this doctesting purpose. Note that I added .../coding/tests directory for this purpose.

tscrim commented 3 years ago
comment:67

Replying to @kwankyu:

Replying to @tscrim:

I think we should

Okay.

what if the pickles get replaced that has something evil included.

The pickles would be part of the Sage lib. Why pickles are more vulnerable than other code? Do you imagine that some evil author and reviewer cooperate to slip in the evil code passing the release manager's eyes?

You end up reading a file on the system and (blindly) executing it in the doctest.

say, some extra method that is monkey-patched into __init__ that deletes everything on the system

Is this possible with a pickle? As far as I know, a pickle does not contain code.

A pickle can (and IIRC usually) contain an objects __dict__, which can contain functions that can be bound to the object I believe. I am not really an expert on this, so it is possible I am just being paranoid (or there are other more likely security breaches to occur).

We also haven't done anything like this in Sage (at least since we removed the pickle jar). So I just want to be a bit cautious here before proceeding.

I agree. I also wondered if there should be a central place to hold the pickles for this doctesting purpose. Note that I added .../coding/tests directory for this purpose.

I saw that and thought that was good. The only place that was natural to me is in the $SAGE_SRC/tests folder (possibly as a coding subfolder).

fchapoton commented 3 years ago
comment:68

too many colons here:

+    TESTS::
+
+    We save the decoder for later tests::
fchapoton commented 3 years ago
comment:69

I guess Apery should be "Apéry"

dimpase commented 3 years ago
comment:70

Can these pickles be replaced by text data? I really struggle to imagine a smallish Sage object that cannot be quickly built from text data or json.

dimpase commented 3 years ago
comment:71

in any event, code to build these .sobj or otherwise pickled objects should be easy to run (I don't know how to run tests which are tagged # not tested).

kwankyu commented 3 years ago
comment:72

Replying to @dimpase:

in any event, code to build these .sobj or otherwise pickled objects should be easy to run (I don't know how to run tests which are tagged # not tested).

Obviously you run the code in the example just above the test tagged with # not tested to construct the object to save.

Do I misunderstand you?

kwankyu commented 3 years ago
comment:73

Replying to @dimpase:

Can these pickles be replaced by text data? I really struggle to imagine a smallish Sage object that cannot be quickly built from text data or json.

It takes no time, relatively, to load and save the pickles. To construct an object(in our case, decoder) to save takes time.

Do you mean that dumping an object to a string and saving the string somewhere is better than a pickle in binary format? Would you explain how and in what respect it is better?

kwankyu commented 3 years ago
comment:74

Replying to @fchapoton:

I guess Apery should be "Apéry"

Right. I didn't try to input an accented character on my keyboard. I will.