sanskrit-lexicon / csl-pywork

A template for creating pywork repository for each dictionary.
3 stars 1 forks source link

Unknown problem with mw.sqlite at Cologne #22

Closed funderburkjim closed 3 years ago

funderburkjim commented 3 years ago

After remaking mw at Cologne just now , the result was that displays only could find words through about Cid.

So, deva, hari, etc showed as not found.

There is no such problem with mw.sqlite in local installation,

so I just copied it up to Cologne server (using scp) and now the displays are fine.

Not sure what the scope or cause of this problem is.

So will be cautious about redoing any of the dictionaries at Cologne until the problem is solved.

funderburkjim commented 3 years ago

The problem is somehow related to L=75944.

rename the bad sqlite file mwbad.sqlite. then use sqlite3 command interactively sqlite3 mwbad.sqlite

select lnum,key,data from mw where lnum = 75944;

The result is ALL the rest of the records, all the way through the last record hveya 264901 !

drdhaval2785 commented 3 years ago

Generation log?

funderburkjim commented 3 years ago

The mw.sqlite file is made from a file 'input.txt'. I copied down that sqlite file to local machine and remade a test mw.sqlite using that file. It remade correctly.

In fact, that input.txt file downloaded from Cologne is identical to the one created on local installation.

So, the problem seems to be with sqlite3 on Cologne server.

I don't know what the 'Generation log' is?

funderburkjim commented 3 years ago

sqlite3 on cologne looks old:

sqlite3 --version 3.7.17 2013-05-20 00:56:22 118a3b35693b134d56ebd780123b7fd6f1497668

On local machine, $ sqlite3 --version 3.29.0 2019-07-10 17:32:03 fc82b73eaac8b36950e527f12c4b5dc1e147e6f4ad2217ae43ad82882a88bfa6

That must be the problem.

Will send this to webmaster.

Probably another thing they forgot to update last week.

funderburkjim commented 3 years ago

Of course, they probably didn't know that we needed a modern sqlite3, and I forgot to mention this earlier to Patrick when the python problem was being solved.

gasyoun commented 3 years ago

Of course, they probably didn't know that we needed a modern sqlite

Exactly.

funderburkjim commented 3 years ago

Unfortunately, Webmaster says that we are using the version of sqlite3 support by Redhat Linux 7.

So, we will have to find some workaround. At this stage, I don't know whether the problem is particular to some detail regarding record 75944 of MW, or whether the problem extends to other dictionaries.

gasyoun commented 3 years ago

sqlite3 support by Redhat Linux 7.

Understood. Can't we temporary empty that ID?

funderburkjim commented 3 years ago

pwg example

I recreated pwg.sqlite, and it appears to be ok.

Based on date-time stamps, pwg.sqlite was previously created on Nov 20; and there is a difference in file-size:

This suggest to me there is some difference between the sqlite3 that was used on Nov 20 and the sqlite3 that is being used today -- Both are based on the same pwg.xml, which was created on Nov 20; so shouldn't the file sizes be identical if the sqlite3 program of Nov 20 is the same as the sqlite3 program today?

pw example

This is similar to pwg example, in that

so pwg and pw seem ok

sqlite at Cologne seems to handle the construction properly.

But pd doesn't work!

This is another large file, with 107630 entries. But the sqlite construction at Cologne generates only 28506 entries! Same kind of problem as for MW.

The pd.sqlite file being used in displays was last created on Jan 24, 2020.

file size comparison for mw

I can only compare against the mw.sqlite created on local computer and uploaded to Cologne, since that is the display version currently at Cologne

total file size seems to matter.

The 2 problem files have total good file sizes of 87113728 (MW) and 89515008 (PD). These are somewhat larger than the next largest, PWG (83094528 ~).

So, could the problem be related to file size somehow?

gasyoun commented 3 years ago

So, could the problem be related to file size somehow?

Because of different versions of sqlite? That is not a big database for sqlite at all.

funderburkjim commented 3 years ago

a quotes problem

Here's what I think is going on.

In one record of mw.txt there is a string '" .

<L>75944<pc>406,3<k1>CidradarSana<k2>Cidra/—darSana<e>3B
<s>Cidra/—darSana</s> ¦ <lex>m.</lex> ‘=’ <s>°rSin</s> '" <ab>N.</ab> of a (<ns>Brāhman</ns> changed into a) <s1 slp1="cakra-vAka">Cakra-vāka</s1>, <ls>Hariv 1216</ls><info lex="m"/>
<LEND>

First, this *does need correction. Nonetheless, When even a tiny 3-record input file is converted to sqlite, and when that record above is the first record, then the current Cologne version of sqlite3 treats all three records as if they were 1 record !

When that character string is removed and then the 3-lines are input into sqlite, then sqlite sees them as 3 records, as expected.

The closest reference I found to this quotes problem is https://forum.lazarus.freepascal.org/index.php?topic=36575.0

By contrast, on local machine, using exactly same problematic 3-line input file, then sqlite3 loads as a 3-record sqlite file, as it should.

So there is definitely what I would call a bug in the particular version of sqlite3 now on Cologne server. That '" string has been in that mw record at least as far back as 2018, and whatever version of sqlite3 was used back then on Cologne server did not have this problem. See this old version of the display at Cologne as proof: image

There are other problems besides '"

For example, a naked single quote or a naked double quote, will generate same problem as '". No way to know all the loci of problems.

But thankfully these are rare. The only one currently known is one in PD, though I haven't yet identified exactly where the problem is there.

Next step is to change the problematic '" text in mw.txt, and confirm that it solves the problem at Cologne.

funderburkjim commented 3 years ago

It's getting late -- so I'll wait until tomorrow to carry the idea through.

gasyoun commented 3 years ago

Next step is to change the problematic '" text in mw.txt, and confirm that it solves the problem at Cologne.

A real detective story. And is we can't change the sql version, we can only alter the text itself.

funderburkjim commented 3 years ago

Have now regenerated mw at Cologne (after removing that '" in csl-orig/v02/mw/mw.txt). And there is no problem with mw.sqlite.

My current conclusion is that the problem is a bug in the particular version of sqlite3 that is currently installed in REL7.

I may try to make a small demo of the problem for the webmaster,
so he can communicate to RedHat if the webmaster agrees that this is a bug that RedHat should fix.

funderburkjim commented 3 years ago

survey of other problem dictionaries

A program was used at Cologne (scans/a_ejf/sqliteprob/make_input.py, with script temp.sh) which search for records in xxx.xml (xml form of some dictionary) with unbalanced double quotes.

make_input.py

Here are the dictionaries with one or more such instances:

      2 ap
      2 pd
     63 bop
      2 bur
      1 gra
      1 mci
      4 sch
     11 vcp
      1 wil

These will be examined individually, and changes made in each xxx.txt digitization to assure that the current Cologne sqlite3 program will process cleanly.

gasyoun commented 3 years ago

I may try to make a small demo of the problem for the webmaster, so he can communicate to RedHat if the webmaster agrees that this is a bug that RedHat should fix.

Makes sense.

funderburkjim commented 3 years ago

I made a toy example for Webmaster and have sent to Patrick.

Since we now have a Python-based workaround for the xxx.sqlite files, this issue can be closed.