sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

generate.py modifications #25

Closed drdhaval2785 closed 8 years ago

drdhaval2785 commented 8 years ago

As is evident from recent submissions for 'ls', There are some corrections which need to be done for not only in single entry, but multiple entries require corrections. Right now we are working with key1 and l-numbers to identify the lines. @funderburkjim must make generate.py versatile enough to read all lines from dict.txt and make changes in all applicable places.

funderburkjim commented 8 years ago

The approach I've taken with #23, is for a program (prepchange.py = Prepare Change), which reads a list of regular expression substitutions and generates the corresponding standard form change records. Then, the normal generate.py for these pw 'ls' corrections operates on the constructed change records.

For instance, the current input file (for issues #23, #24) is

@
¯HEM[.] *¯PAR[.]@¯HEM.PAR.@t@one reference
¯DAMAJANTIK[.]@¯DAMAJANTI7K.@t@ new reference
¯C2ILA7N5KA@¯C2I7LA7N5KA@t@
¯C2I7LA7N5RA@¯C2I7LA7N5KA@t@

The first parameter is treated as a regular expression.

Here are the first few lines of the generated change.txt file:

¯HEM.¯PAR.@DagadDagiti@@54344:¯HEM.PAR.:t:one reference
¯HEM.¯PAR.@Dana@@54356:¯HEM.PAR.:t:one reference
¯HEM.¯PAR.@Danagiri@@54364:¯HEM.PAR.:t:one reference
¯HEM.¯PAR.@Danadeva@@54387:¯HEM.PAR.:t:one reference

Note that I put the empty string for key2, since that field is not used by generate.py.

There will probably need to be other variations to generate change records, which will be devised as the need arises.

funderburkjim commented 8 years ago

Change of the form for the input file.

The 1st parameter will be treated as a string, NOT a regular expression.

I started with a regular expression, since this was an easy way to catch the spaces between ¯HEM. and ¯PAR.. However, regular expressions have the feature that the period character is treated as a wild-card, matching any character. The strings we are interested in often contain periods. So, to use regular expressions AND to avoid wild-card matching of periods, we would need to enclose the periods in the first parameter in brackets [.]. But this is quite awkward.

So, the input file shown above now has two lines for the HEM.PAR. example, one with a space and one without (I checked that there are no instances with more than 1 space.)

@
¯HEM.¯PAR.@¯HEM.PAR.@t@one reference
¯HEM. ¯PAR.@¯HEM.PAR.@t@one reference
¯DAMAJANTIK.@¯DAMAJANTI7K.@t@ new reference
¯C2ILA7N5KA@¯C2I7LA7N5KA@t@
¯C2I7LA7N5RA@¯C2I7LA7N5KA@t@

Incidentally, the first line in the file serves the purpose of setting the field separator character (@) for the rest of the lines in the file. Probably, @ will never occur in any of the strings we are searching for. But in case we need another field separator for some cases, we could use it in the file for those cases.

gasyoun commented 8 years ago

Getting complicated, but so be it. Hope @thomasincambodia is happy to see where it heads.