sanskrit-lexicon / csl-pywork

A template for creating pywork repository for each dictionary.
3 stars 1 forks source link

Use python to make xxx.sqlite #24

Open funderburkjim opened 3 years ago

funderburkjim commented 3 years ago

This documents a change in the way the xxx.sqlite database files are constructed. The motivation was the fact that, with at least some versions of the sqlite3 binary under Red Hat Linux 7, there appears to be a bug.

funderburkjim commented 3 years ago

Discovery of the sqlite3 bug is discussed in #22 .

The consequences of the bug can be seen in a temporary display of the Wilson dictionary that was constructed with the sqlite3 executable. The example shows that 'seva' is not found:

image

Technical note for Dhaval

The usual way that the Wilson dictionary displays are reconstructed at Cologne is to run, in csl-pywork/v02, the command sh generate_dict.sh wil ../../WILScan/2020/

For testing purposes, the target directory can be something other than ../../WILScan/2020/`

For instance, sh generate_dict.sh wil tempwil was run (while the old sqlite3 method was still current) and it generated the display illustrated in the image, where 'seva' was not found in the Wilson database.

funderburkjim commented 3 years ago

Python construction of wil.sqlite

The Wilson displays have now been constructed using the python program, in the usual way. We see that 'seva' is back as it should be:

image

funderburkjim commented 3 years ago

Notes on sqlite.py

The code used is sqlite.py

It is run (see redo.sh by (for wilson example) python sqlite.py ../wil.xml wil.sqlite.

Python (versions 2 or 3), have a built-in 'sqlite3' module. And our usage is fairly simple.

batch size 10000

The trickiest part is the choice to insert rows in batches. We have two competing factors regarding insertion of rows:

A crude timing test was run with Wilson to try various batch sizes for inserting the 44578 rows of Wilson dictionary:

This suggested that 10000 lines is the best default choice.

Reconstructing mw (with 287443 rows) takes, on Cologne server, about 17 seconds. This compares to about 11 seconds with the previous sqlite3 and php code.

These speeds for the python version are acceptable.

funderburkjim commented 3 years ago

old version kept

The prior php/sqlite3 method for constructing xxx.sqlite was retained in a separate, but currently unused, folder : sqlite_20201216.

funderburkjim commented 3 years ago

Other uses of sqlite3

There are various other sqlite databases used in the Cologne displays. Since all of these use a table definition file whose name ends in .sql, a list of these can be found.

# in csl-pywork/v02:
 find . -name "*.sql"
./distinctfiles/bur/pywork/burab/burab.sql
./distinctfiles/cae/pywork/caeab/caeab.sql
./distinctfiles/lan/pywork/lanab/lanab.sql
./distinctfiles/mw/pywork/mwab/mwab.sql
./distinctfiles/mw/pywork/mwauth/mwauthtooltips.sql
./distinctfiles/mw/pywork/mwkeys/mwkeys.sql
./distinctfiles/mw/pywork/westmwtab/westmwtab.sql
./distinctfiles/mw/pywork/whitmwtab/whitmwtab.sql
./distinctfiles/pw/pywork/pwab/pwab.sql
./distinctfiles/pw/pywork/pwauth/pwbib.sql
./distinctfiles/pwg/pywork/pwgab/pwgab.sql
./distinctfiles/pwg/pywork/pwgauth/pwgbib.sql
./distinctfiles/stc/pywork/stcab/stcab.sql

It would be good, but currently not believed to be essential, to use a similar python program in all these cases. The program would be a close variant of the sqlite.py discussed above; I am not sure whether one variant would work for all the above -- I suspect there may be minor differences among the assumed format of the input files among some of the database creation steps in the above directories; and these differences would need to be taken into account.

AFAIK, All the sqlite database creations are done by code in this csl-pywork repository.

funderburkjim commented 3 years ago

I think that the current REL7 sqlite3 correctly constructs all these extra tables.

Nonetheless, it would be good to convert these constructions to use Python, as this would reduce the dependency by one (namely sqlite3 executable).