sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.32k stars 453 forks source link

Interface to the KnotInfo and LinkInfo databases #30352

Closed soehms closed 3 years ago

soehms commented 4 years ago

At the moment Sage offers just a small set of 250 named knots (src/sage/knots/knot_table.py) taken form the Rolfsen table. Proper named links aren't available at all.

Nowadays, larger databases for knots and links are available at the Knot Atlas pages in RDF-format and at KnotInfo as XLS / XLSX -files. Since parsing of CSV files is already supported by Sage, this is a good start to produce a Sage packages from these files containing about 3000 knots and 4000 proper links together with a lot of their properties and invariants.

Such a package has a couple of advantages:

  1. Perform cross-checks for about 7000 links of alternative implementations of certain methods.
  2. Do cross-checks against results listed in the KontInfo database.
  3. Provide properties for these links that are not provided by Sage, yet.
  4. Implement a link identification method for our link class (like KnotFinder).
  5. Launch webpages containing additional information for a link or alternate graphical representations.

The aim of this ticket is to have the databases accessible in Sage together with conversion methods for the most important properties and invariants.

Many thanks to Allison Moore and Chuck Livingston for their kind permission to have this interface implemented and their offer to support us.

Having checked out the ticket for the first time, you have to run

./configure --enable-download-from-upstream-url
sage -i database_knotinfo

in order to have the databases installed. If you like to run all relevant doctests on the installation use:

sage -i -c database_knotinfo

CC: @miguelmarco @mkoeppe @kiwifb

Component: algebraic topology

Keywords: knot, link

Author: Sebastian Oehms

Branch: 9cde996

Reviewer: Matthias Koeppe

Issue created by migration from https://trac.sagemath.org/ticket/30352

soehms commented 3 years ago
comment:38

I do the following changes:

  1. I make TestSuite functionality more clear: I rename the former methods _test_recover to user visible boolean methods is_recoverable. I add a new _test_recover to class KnotInfoSeries which just runs the test using is_recoverable and the tester -option max_samples.

  2. I add a new method is_unique to class KnotInfoBase which tests if a proper link is unique in the database under isotopy. This is needed in get_knotinfo and is_isotopic in order to give more reliable answers in unclear situations.

  3. I extend is_amphicheiral to proper links by internal tests (needed for get_knotinfo, as well).

  4. I remove option oriented from get_knotinfo and improve the quality of its answer in unclear situations.

  5. I add a warning to the docstring of method from_dowker_code of class Knot according to a hint of Chuck Livingston.

soehms commented 3 years ago
comment:39

Replying to @tscrim:

Replying to @soehms:

I would have expected the database itself to have a consistency check like this.

Do you mean on the installation procedure with option -c? I could run tests with larger samples there in addition to the doctests, shall I?

The link with the database somehow becomes broken, such as they change the name of a column. So the code breaks once you install the database. Granted, I think this is unlikely. Looking over the design a bit more, having a future developer should not naturally avoid the methods that differentiate between the two.

If the package will be upgraded to a new database version then the patchbots would detect a change of a column-name that wasn't performed in the static dictionary. But unfortunately, they don't take package tickets. Indeed, in the case of this package it would make sense if they would.

I agree that it is an advantage to have it tested. Although I do believe it should not be done locally within Sage's library but with a more robust testing framework. Yet, I believe that the benefits here clearly outweigh the costs.

What robust testing framework do you mean? If the database is not installed the all doctests of the ticket consume less than five seconds (on i5).

tscrim commented 3 years ago
comment:40

Replying to @soehms:

Replying to @tscrim:

Replying to @soehms:

I would have expected the database itself to have a consistency check like this.

Do you mean on the installation procedure with option -c? I could run tests with larger samples there in addition to the doctests, shall I?

No, I mean that there is a _test_database type method on the database class. The -c option is good, but I also think we shouldn't have that take too long.

I agree that it is an advantage to have it tested. Although I do believe it should not be done locally within Sage's library but with a more robust testing framework. Yet, I believe that the benefits here clearly outweigh the costs.

What robust testing framework do you mean? If the database is not installed the all doctests of the ticket consume less than five seconds (on i5).

There are patchbots/buildbots with the database that check that everything still works rather than a group of us with the database installed running tests after each beta release.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from e8ec73e to 440b5f3

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

440b5f330352: add _test_database and fix broken installation
soehms commented 3 years ago
comment:42

Replying to @tscrim:

Replying to @soehms:

No, I mean that there is a _test_database type method on the database class. The -c option is good, but I also think we shouldn't have that take too long.

I add such a _test_database method which tests for a random sample of 20 links (by default). I marked the TestSuite doctest as long time. It takes less than 2 seconds if the database isn't installed and about 20 seconds else (including loading of the database). In addition I had to do some changes since the installation was broken (because of the usage of feature and UniqueRepresentation). Furthermore, I make the -c installation option take this long doctest, as well.

There are patchbots/buildbots with the database that check that everything still works rather than a group of us with the database installed running tests after each beta release.

Are such patchbots/buildbots already possible?

tscrim commented 3 years ago
comment:43

Replying to @soehms:

Replying to @tscrim:

Replying to @soehms:

No, I mean that there is a _test_database type method on the database class. The -c option is good, but I also think we shouldn't have that take too long.

I add such a _test_database method which tests for a random sample of 20 links (by default). I marked the TestSuite doctest as long time. It takes less than 2 seconds if the database isn't installed and about 20 seconds else (including loading of the database). In addition I had to do some changes since the installation was broken (because of the usage of feature and UniqueRepresentation). Furthermore, I make the -c installation option take this long doctest, as well.

Thank you. I think that will help with the testing.

Dima, Miguel, anyone else have any additional comments before I set this to a positive review?

There are patchbots/buildbots with the database that check that everything still works rather than a group of us with the database installed running tests after each beta release.

Are such patchbots/buildbots already possible?

It is possible, but with all the different combinations, it is impossible to maintain as there is some different behavior depending on certain optional (experimental?) packages being installed. Although I would advocate for having at least one buildbot that has all optional (and possibly experimental) packages installed that runs tests.

mkoeppe commented 3 years ago
comment:44

Typo: KontInfo (3 times)

mkoeppe commented 3 years ago
comment:45

In terms of packaging, I think it would be much preferable to create a pip-installable package than to have a Sage specific upstream tarball.

See #30914 (Meta-ticket: Create upstream repositories, pip-installable packages for database packages)

mkoeppe commented 3 years ago
comment:46

Also, in SPKG.rst please follow the new format of the title from #29655

kiwifb commented 3 years ago
comment:47

I see that upstream stores the original files in excel spreadsheets and it is then exported to cvs with some substitutions in libreoffice. That is not a sustainable approach unless you have some libreoffice automated scripting.

I would recommend a python script using pandas [not included in sage] or a R script to perform such task.

Asides from those workflow issues some proper packaging as something pip installable would indeed be nice. It should be relatively trivial if we only install the data.

dimpase commented 3 years ago
comment:48

I see that upstream stores the original files in excel spreadsheets and it is then exported to cvs with some substitutions in libreoffice.

I am sure this explains KontInfo. (Pardon my French...)

soehms commented 3 years ago
comment:49

Replying to @mkoeppe:

Typo: KontInfo (3 times)

Thanks!

In terms of packaging, I think it would be much preferable to create a pip- installable package than to have a Sage specific upstream tarball.

I would like to do that, but I would prefer to do it in a follow-up ticket. Having never done this before, I will likely need advice and maybe help (plus time that I won't have until February). Are there any examples that I can see how I can do this?

Also, in SPKG.rst please follow the new format of the title from #29655

I will do that!

soehms commented 3 years ago
comment:50

Replying to @kiwifb:

I see that upstream stores the original files in excel spreadsheets and it is then exported to cvs with some substitutions in libreoffice. That is not a sustainable approach unless you have some libreoffice automated scripting.

I know, this was only intended as a temporary solution (after failing to use pandoc). I reported some minor (and non-significant) issues upstream and waited for them to provide new files. If not, the existing tarball is good enough to start.

I would recommend a python script using pandas [not included in sage] or a R script to perform such task.

Thanks for your suggestions. I will see which one is appropriate to implement such a script.

Asides from those workflow issues some proper packaging as something pip installable would indeed be nice. It should be relatively trivial if we only install the data.

Do you know examples that I can follow?

kiwifb commented 3 years ago
comment:52

Replying to @soehms:

Replying to @kiwifb:

I would recommend a python script using pandas [not included in sage] or a R script to perform such task.

Thanks for your suggestions. I will see which one is appropriate to implement such a script.

Amusingly, I did the exact reverse for some people in the school of economics in my university. They had large csv files to download and they wanted to transform them into excel files - during the process we had to add some substitutions for missing values. They wanted the files in excel format as an input for STATA - I want to cry sometimes with some researchers.

Asides from those workflow issues some proper packaging as something pip installable would indeed be nice. It should be relatively trivial if we only install the data.

Do you know examples that I can follow?

Good one. We have identified that as a need for our data packages but I don't think we have done it with any. I cannot think of a python package that is a pure data load either. Possibly because people do not usually bother which is sad.

mkoeppe commented 3 years ago
comment:53

A few more comments.

  1. If you do git grep SAGE_ROOT src/sage, you will see that we have essentially eliminate use of this variable in the Sage library. This ticket reintroduces it, mixing Sage-the-distribution-specific code with Sage library code. That's not a good direction. In particular, Sage library code should not refer to SAGE_ROOT/build/pkgs/%s/package-version.txt at all - as this may not be available in downstream distribution packaging of Sage.

  2. The purpose of subclasses sage.features.StaticFile is to provide an interface to discovering files in an installation. KnotInfoFilename.knots.sobj_path should use sage.features.databases.DatabaseKnotInfo to find the path, not the other way around.

mkoeppe commented 3 years ago
comment:54

I also don't fully understand the purpose of the data transformation that is happening at installation time, reading the csv files and creating many sobj files, in functions such as _create_col_dict_sobj etc. Each of the little files is storing a dictionary mapping strings to strings as a pickle (sobj)?

soehms commented 3 years ago
comment:55

Replying to @kiwifb:

Replying to @soehms:

Amusingly, I did the exact reverse for some people in the school of economics in my university. They had large csv files to download and they wanted to transform them into excel files - during the process we had to add some substitutions for missing values. They wanted the files in excel format as an input for STATA - I want to cry sometimes with some researchers.

I'm also amazed that pure math data is stored in Excel spreadsheets, but missing values haven't been a problem here so far (with the exception of the trivial knot, which I had to deal with separately in some cases). But there was a misplaced character and trailing and leading whitespaces (which of course can be handled using strip).

The reason why I converted them to csv is that I found no Excel reader included in Sage. You mentioned that pandas isn't included in Sage, as well. So, how can I use it in spkg-install?

Good one. We have identified that as a need for our data packages but I don't think we have done it with any. I cannot think of a python package that is a pure data load either. Possibly because people do not usually bother which is sad.

I am open to try making a prototype. But that should be on a follow-up ticket.

soehms commented 3 years ago
comment:56

Replying to @mkoeppe:

A few more comments.

  1. If you do git grep SAGE_ROOT src/sage, you will see that we have essentially eliminate use of this variable in the Sage library. This ticket reintroduces it, mixing Sage-the-distribution-specific code with Sage library code. That's not a good direction. In particular, Sage library code should not refer to SAGE_ROOT/build/pkgs/%s/package-version.txt at all - as this may not be available in downstream distribution packaging of Sage.

  2. The purpose of subclasses sage.features.StaticFile is to provide an interface to discovering files in an installation. KnotInfoFilename.knots.sobj_path should use sage.features.databases.DatabaseKnotInfo to find the path, not the other way around.

Sorry, that I didn't realize that! Of course I will correct it!

soehms commented 3 years ago
comment:57

Replying to @mkoeppe:

I also don't fully understand the purpose of the data transformation that is happening at installation time, reading the csv files and creating many sobj files, in functions such as _create_col_dict_sobj etc. Each of the little files is storing a dictionary mapping strings to strings as a pickle (sobj)?

Perhaps this is ridiculous given the size of these databases, but the purpose is to minimize the memory load. The user only needs a few of the 120 columns in the tables at a time (so why load them all each time).

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

ce3bfddMerge branch 'u/soehms/knotinfo' of trac.sagemath.org:sage into knotinfo_30352
5844cae30352: new tarball version 20210201
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 440b5f3 to 5844cae

soehms commented 3 years ago
comment:59

I do the following:

  1. Changes which I have announced in comments 49, 50 and 56 according to review.
  2. Upgrade to a new tarball since one of the both Excel files has changed (three additional columns and some whitespace erasing).
  3. I add 2 more conversion methods (three_genus and signature) of link properties which have been of interest, recently (#31188).
  4. One fix, since there was still a single link (L10a171_1_1_0) that caused TestSuite (with option max_samples=infinity) fail for class KnotInfoDatabase.
  5. I put KnotInfo and KnotInfoSeries into the global namespace in case the database is installed.
soehms commented 3 years ago

Description changed:

--- 
+++ 
@@ -29,5 +29,5 @@

-Traball: https://github.com/sagemath/sage/files/ticket30352/knotinfo-20200813.tar.bz2.gz +Traball: https://github.com/soehms/sagemath_knotinfo/blob/main/knotinfo-20210201.tar.bz2?raw=true

mkoeppe commented 3 years ago
comment:61

Some suggestions for the upstream repository (in the direction of #30914):

soehms commented 3 years ago
comment:62

Replying to @mkoeppe:

Some suggestions for the upstream repository (in the direction of #30914):

  • Call it database_knotinfo, not sagemath_knotinfo -- it will be useful to a broader community (Python)
  • It's redundant to put versioned tarballs in a git repository - put instead the unpacked tarball there and have git take care of versioning
  • When done, I can send you a pull request that turns this repository into a pip-installable package

Sounds good! I hope the new repository is as expected. Please don't hesitate to make any changes you think are necessary! Many Thanks!

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 5844cae to 6e5e1e5

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

1721a44Merge branch 'u/soehms/knotinfo' of trac.sagemath.org:sage into knotinfo_30352
6e5e1e5adaption to new beta version and some typo and style fixes
mkoeppe commented 3 years ago
comment:64

Sage development has entered the release candidate phase for 9.3. Setting a new milestone for this ticket based on a cursory review of ticket status, priority, and last modification date.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

66a555bMerge branch 'u/soehms/knotinfo' of trac.sagemath.org:sage into knotinfo_30352
9cde99630352: installation via PyPI
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 6e5e1e5 to 9cde996

soehms commented 3 years ago
comment:66

Replying to @mkoeppe:

  • When done, I can send you a pull request that turns this repository into a pip-installable package

I've tried it now on my own and at least it is working. But since I needed a lot of trial error cycles I would appreciate it if you could have a look at it.

The current commit here concerns the adaption to the pip-installable package. Furthermore, it contains adaptations to SnapPy 3.0.1 and an addition of some verbose messages to is_isotopic (following a suggestion of knot theorists from the University of Regensburg).

soehms commented 3 years ago

Description changed:

--- 
+++ 
@@ -15,19 +15,15 @@
 Many thanks to Allison Moore and Chuck Livingston for their kind permission to have this interface implemented and their offer to support us.

-
 Having checked out the ticket for the first time, you have to run

-make SAGE_SPKG="sage-spkg -o" database_knotinfo-clean build +./configure --enable-download-from-upstream-url +sage -i database_knotinfo


 in order to have the databases installed. If you like to run all relevant doctests on the installation use:

-make SAGE_SPKG="sage-spkg -o" SAGE_CHECK="yes" database_knotinfo-clean build +sage -i -c database_knotinfo

-
-
-Traball: https://github.com/soehms/sagemath_knotinfo/blob/main/knotinfo-20210201.tar.bz2?raw=true
-
tscrim commented 3 years ago
comment:68

I don't think it is necessary to cache homfly_polynomial() as all of the key computational aspects are cached and so you don't cache the "same" object even though someone changed the variable name.

Other than that, I am happy with the current state of things. Does anyone else have any comments or suggestions?

mkoeppe commented 3 years ago
comment:69

Looks great to me!

mkoeppe commented 3 years ago

Reviewer: Matthias Koeppe

soehms commented 3 years ago
comment:70

I don't think it is necessary to cache homfly_polynomial() as all of the key computational aspects are cached and so you don't cache the "same" object even though someone changed the variable name.

I agree that this is not that effective. In general, my consideration concerning caching was that with the database available you easily can have hundreds or thousands of invocations of any method. Anyway, I think that is nothing that could hurt, and thus would keep it for a start.

Many thanks to everyone who helped to have this interface realized!

vbraun commented 3 years ago

Changed branch from u/soehms/knotinfo to 9cde996

kiwifb commented 3 years ago
comment:72

Follow up at #31921.

kiwifb commented 3 years ago

Changed commit from 9cde996 to none

soehms commented 2 years ago
comment:73

Another follow up at #32760.