drdhaval2785 commented 7 years ago

130 mandates that we create one uniform headword line.

I started with ACC conversion. Jim gave some practical tips privately on mail.

I guess there are some items in those tips which need reproduction verbatim here for public consumption.

drdhaval2785 commented 7 years ago

Hi, Dhaval - There are a couple of problems:

example: ``` old: 11-001,1-1aMSadaSAaMSadaSA new: 11-001,1aMSadaSAaMSadaSA ``` The ',1' is indicating the first column of page 1-001 (first page of first volume). There is no need for the '-1'
Your accwithmeta.txt skips lines at the beginning and end of acc.txt; These are lines before the first headword and after the last headword. We want to have ALL the lines of acc.txt PLUS the meta-lines. So # of lines of accwithmeta.txt should = # of lines of acc.txt + NHW
where NHW is # of headwords, which could be computed as # of lines in acchw0.txt or acchw2.txt

Couple of choices (These are not exactly problems - but especially note the comment on invertibility).

If there is no homonym, the element could be omitted from the meta-line. Reason in favor of omitting if no homonym:
- absence of implies no homonym
- remove superfluous markup Reason in favor of keeping even if no homonym:
- parsing of meta line would be marginally simpler.
The lines could be simplified to be more representative of the printed text. For ACC, this could be old:
{#aMSadaSA#}¦ jy. Rice 28. new: {#aMSadaSA#} jy. Rice 28. Reasons for removing `` - The `` is not needed any more to recognize entry headwords, since this information is in the meta line - The new make_xml will not need to have logic to represent this Reason for removing ¦ (or replacing it with a space) - Similarly, the broken bar ¦ is not needed to delineate the part of the entry from which the key2+homonym derive. - I am less certain regarding ¦ . If we ever wanted to reparse this first line of an entry to rederive the meta line, then this demarcation would be useful. Let me expand on this argument regarding keeping ¦. One principle I always have in mind when making a big change, such as this meta-line change, to a digitization is a principle of invertibility. In fact, I often try to write a program which reconstructs the original from the modified version. If such a program is written, then we know for sure that we have lost no information by our modifications - our inverting program proves this. Applying to this case, this would mean having a separate program invert_meta.py which would read accwithmeta.txt and construct acc_invert_meta.txt, with the objective of the program to be that acc_invert_meta.txt should be absolutely identical to the original acc.txt: e.g. diff acc_invert_meta.txt acc.txt should show no difference. It may or may not be essential to keep the ¦ for invertibility. If it is needed, we should keep ¦; if it is not needed, we should discard ¦. BTW the 2nd problem above (dropping lines) would have been noticed by the invertibility discipline. Jim P.S. I haven't examined your code yet. Will do that after the final form is established

drdhaval2785 commented 7 years ago

Principle of invertibility

Generation code - here Reversal code - here

Happy to report that there is no difference after reverse journey.

drdhaval2785 commented 7 years ago

11-001,1-1aMSadaSA. Remove -1.

Done.

drdhaval2785 commented 7 years ago

Your accwithmeta.txt skips lines at the beginning and end of acc.txt

Done. Corrected code to keep these items. Now there is no diff.

drdhaval2785 commented 7 years ago

If there is no homonym, the element could be omitted from the meta-line.

I agree. Will reduce size and remove superfluous items. Coded accordingly.

Reason in favor of keeping even if no homonym: parsing of meta line would be marginally simpler.

Careful drafting of code may parse the meta line properly. Not much hassle.

drdhaval2785 commented 7 years ago

broken bar ¦

I feel we should keep this broken bar or some other unusual separator for sure.

I propose that we keep this broken bar / anything unique in its place. We should have a uniform separator for 36 dictionaries all.

gasyoun commented 7 years ago

Reason for removing ¦ (or replacing it with a space)

Makes sense.

I propose that we keep this broken bar / anything unique in its place. We should have a uniform separator for 36 dictionaries all.

This ¦ makes me want to scream when I see it in the code. If all will have it I'll die. Even @ will be better.

In fact, I often try to write a program which reconstructs the original from the modified version.

This is the lesson I learned when I converted non-Unicode devanagari in Adobe InDesign. I missed this step and the price was high.

funderburkjim commented 7 years ago

¦ makes me want to scream

I thought I heard an unusual sound recently -- glad I was too far away to get the full force of the scream :)

funderburkjim commented 7 years ago

@drdhaval2785 I noticed that you left the <HI>, so the first form of the entry (after the meta line) is

<HI>complex headword ¦ rest of first line

The conventions used by other dictionaries (like .<P>hw ¦) might be changed to this model.

drdhaval2785 commented 7 years ago

@drdhaval2785 https://github.com/drdhaval2785 I noticed that you left the

, so the first form of the entry (after the meta line) is complex headword ¦ rest of first line I left it as it is because we had not decided on headword-derived-part structure yet. I feel there are mainly two items here. Key2, hom. I propose we keep opening part as {#key2#}hom¦ for all dictionaries. If we want to add some information like 100 in PW or some other dictionary, {#key2#}hom[extraInfoField]¦ This should work as generic solution. Please revisit headwordforms.md and change the suggested form. I will make necessary changes to code.

funderburkjim commented 7 years ago

Please revisit headwordforms.md

? I thought we were talking about acc?

Your {#key2#}hom¦ is fine. Are you going to implement this for acc?

Let's defer discussion of the [extraInfoField] situation when it arises.

drdhaval2785 commented 7 years ago

? I thought we were talking about acc?

Yes, but decisions we take should be generic so that future implementations in other dictionaries dont pose much of a problem.

Your {#key2#}hom¦ is fine. Are you going to implement this for acc?

Yes.

Let's defer discussion of the [extraInfoField] situation when it arises.

I agree. It is just an add on if at all needed later.

funderburkjim commented 7 years ago

One detail re {#key2#}hom¦ form is that sometimes key2 will be presented in IAST (unlike acc, where key2 is presented in Devanagari). The {#...#} is appropriate for acc, since this is the universal coding in dictionaries for Devanagari transcoded into SLP1. For IAST key2, this {#...#} would not be quite right.

gasyoun commented 7 years ago

I agree. It is just an add on if at all needed later.

Dhaval's planning for future disasters is welcome.

{#...#} is appropriate for acc, since this is the universal coding in dictionaries for Devanagari transcoded into SLP1

Oh, never knew before. Is there a full list of what is what in coding?

funderburkjim commented 7 years ago

special end of entry code needed in xxx.txt

Our proposed structure for acc.txt now looks approximately like

[LINES BEFORE FIRST ENTRY]
<L>1<pc>1-001,1<k1>aMSadaSA<k2>aMSadaSA<h>    [META LINE FOR FIRST ENTRY]
{#aMSadaSA#}¦ REST OF LINE
[OTHER LINES FOR FIRST ENTRY, IF ANY]
<L>2<pc>....   [META LINE FOR 2ND ENTRY]
{#KEY2#}¦ REST OF LINE
[OTHER LINES FOR 2ND ENTRY, IF ANY]
...
...   REPETITION OF THIS PATTERN FOR ALL THE ENTRIES
...
<L>99999<pc>....   [META LINE FOR **LAST** ENTRY]
{#KEY2#}¦ REST OF LINE
[OTHER LINES FOR LAST ENTRY, IF ANY]
[LINES AFTER LAST  ENTRY]      <<<<<

The point of making this structure summary is to emphasize a minor deficiency:

There is no way to tell where the LAST entry ends.

For any entry except the last, we infer that the digitization lines for that entry include all lines

up to and including the line before the next <L> meta line*.
But for the last entry, there is no next <L> meta line. And, the lines for the last entry typically do not include all lines following the last meta line.

Thus, we need an extra line with a marker indicating the end of the last entry.

I'm not quite sure what to call this marker, maybe <LEND> or maybe <L>END .

Although not applicable to acc, there is another circumstance where this entry-ending marker will be required. In a few dictionaries (VCP is one, there are probably a couple of others) there are some sections of the digitization which are between two entries but not part of an entry.
Identifying the end of the entry preceding such a section also will require some special marker; probably we could use the same marker as that for the last entry.

gasyoun commented 7 years ago

END

If a simple empty line at the end is not enough, let it be so.

funderburkjim commented 7 years ago

Is there a full list of what is what in coding?

This markup information for xxx.,txt is in the xxx-meta.txt file.

As big changes occur (such as AS->IAST) for a dictionary, the markup information for new xxx.txt is being put in a separate xxx-meta2.txt file.

gasyoun commented 7 years ago

As big changes occur (such as AS->IAST) for a dictionary, the markup information for new xxx.txt is being put in a separate xxx-meta2.txt file.

I guess you are the only one who knows where to look for and in what cases. It's sooo complicated.

drdhaval2785 commented 7 years ago

Hi Jim, After seeing your comment regarding VCP etc, I feel like keeping <LEND> after each line. Currently kept like that. If you feel it is superfluous, I can remove it and keep it at the end of last entry.

Similarly, there are other circumstances also which favour such choice e.g. ACC has three books. There are addenda corrigenda part after each book.

gasyoun commented 7 years ago

ACC has three books. There are addenda corrigenda part after each book.

How will it help in such cases? PWG, PWK has 7 volumes and we know after which L's the corrigenda start. But how does the <LEND> help?

drdhaval2785 commented 7 years ago

Hi @funderburkjim ,

Now the acc.txt is having the headword metadata line in it. Typical lines

<L>36<pc>1-001,2<k1>agastyagItA<k2>agastyagItA
{#agastyagItA#}¦ from Paśupālopākhyāna of Varāhapurāṇa.
<>Burnell 193^b.<LEND>
<L>37<pc>1-001,2<k1>agastyaniGaRwu<k2>agastyaniGaRwu
{#agastyaniGaRwu#}¦ vocabulary. Oppert 7795.<LEND>

The subsequent programs for generating XML from it have been suitably modified, mainly hw0.py and make_xml.py.

I guess there are not many errors, because I have retained the old code as much as I could and Jim's python didn't spit out errors.

@funderburkjim the git pull may be relatively heavy this time, because the headword startline and endlines have changed. and acchw0.txt / acchw1.txt / acchw2.txt were part of the git.

But subsequent changes may not be that heavy. Let us accept this as necessary evil.

gasyoun commented 7 years ago

Let us accept this as necessary evil.

Indeed.

funderburkjim commented 7 years ago

@drdhaval2785 I have a small comment regarding <LEND>, but want to be sure you agree with the changes I made today (commit 2db107fb7...) before making that comment.

@gasyoun The <LEND> provides an explicit end to an entry. That's why it is useful.

drdhaval2785 commented 7 years ago

@drdhaval2785 https://github.com/drdhaval2785 I have a small comment regarding , but want to be sure you agree with the changes I made today (commit 2db107fb7...) before making that comment.

I agree with that comment except the line break additions. This gives a large diff file. Earlier version kept line breaks as near to each other as possible. So diff files were actually diff files.

funderburkjim commented 7 years ago

git and line endings

For work on Unix systems, such as Cologne server and dev server, text files have lines ending in '\n' (line feed).

Text files created by native Windows OS apps have lines ending in '\r\n',

Our initial dev server repository was created from files copied from Cologne server to Dev server. All these files have Unix line endings.

When we clone this repository to our local windows machine, the files still have unix line endings.

However, when we work with files in this local repository, changes to the line endings can occur, thanks in part to git. For instance, I noticed that acc2.txt somehow appeared to have the windows line endings; and this caused a plain 'diff ../../acc2.txt acc_invert_meta.txt' to give a huge diff file.

An example shows how the git system can be very confusing with regard to line endings. When I recreated acc2.txt (as described elsewhere) --- with Unix line endings. At this point 'git status' showed that orig/acc2.txt had changed. When I then did 'git add acc2.txt', an odd thing happened. A 'git status' showed that there were nothing to commit ! Weird and confusing.

I think we are both following the general recommendation of git documentation by setting this config: ref

git config --global core.autocrlf true

The documentation describes this as:

Git can handle this by auto-converting CRLF line endings into LF when you add a file to the index, and 
vice versa when it checks out code onto your filesystem.

But in practice, it is confusing to know when Git 'checks out code onto your filesystem'.
All the discussion in stackoverflow and the git documentation shows that, despite Git's best intentions, the situation with line-endings is still confusing when working with Git projects in both unix and windows.

a possible solution for us

First, we should each set our local git config to the git recommendation:

git config --global core.autocrlf true

Second, we can replace the diff utility with a Python program.

python diff.py <file1> <file2>

Why diff.py is better than 'diff -w'

The unix diff utility with the '-w' option compares two files but ignores all white space. So if file1 and file2 were the same except for a possible difference in line endings, then 'diff -w' would show no difference in the two files. Also diff.py would show no difference. In this circumstance, the diff.py and 'diff -w' give the same answer, as desired.

However, it would also show no difference between these two one-line files:

FILE1
This is a line with spaces.

FILE2
Thisisalinewithspaces.

For our purposes, these two files should be counted as different; and diff.py does count them as different.

unixify.py

It might be that from time to time we need to assure that a particular file has unix line endings.

python unixify.py <file>

This program reads the file into an array of lines, with the line endings stripped. Then it writes these lines back onto the file, with Unix line endings.

Both unixify.py and diff.py are in the pywork directory of acc repository.

redo.sh has been changed to use diff.py instead of 'diff -w'.

readme.txt has been revised to be consistent with redo.sh.

These changes have been pushed to dev server (commit 8908d043...).

@drdhaval2785 agree with this solution for line-ending problem?

funderburkjim commented 7 years ago

small comment on `<LEND>`

Currently, the entry-ending tag <LEND> is placed at the end of the last line of the entry:

<L>36<pc>1-001,2<k1>agastyagItA<k2>agastyagItA
{#agastyagItA#}¦ from Paśupālopākhyāna of Varāhapurāṇa.
<>Burnell 193^b.<LEND>

Since that tag is also a 'meta' tag (i.e., not part of the original text of the entry), would it be better for it to be on a separate line?

<L>36<pc>1-001,2<k1>agastyagItA<k2>agastyagItA
{#agastyagItA#}¦ from Paśupālopākhyāna of Varāhapurāṇa.
<>Burnell 193^b.
<LEND>

@drdhaval2785 What do you think?

If you agree, why don't you make the necessary changes to meta_hw.py and invert_meta.py, and rerun redo.sh. Don't bother with changing hw2.py or hw0.py or make_xml.py -- I need to do some revisions of these (to make provision for alternate headwords), and will make the minor changes to handling new location of LEND at that time.

gasyoun commented 7 years ago

would it be better for it to be on a separate line?

I guess so.

drdhaval2785 commented 7 years ago

@drdhaval2785 https://github.com/drdhaval2785 agree with this solution for line-ending problem?

I agree wholeheartedly.

drdhaval2785 commented 7 years ago

Since that tag is also a 'meta' tag (i.e., not part of the original text of the entry), would it be better for it to be on a separate line?

Yes

@drdhaval2785 https://github.com/drdhaval2785 What do you think? If you agree, why don't you make the necessary changes to meta_hw.py and invert_meta.py, and rerun redo.sh.

I will do so.

drdhaval2785 commented 7 years ago

@funderburkjim Done the changes in meta_hw.py, invert_meta.py, reran redo.sh. Just made a single line change in hw0.py (decrement of 1 from line number of <LEND> to demarcate entry ending).

The field is now open for you to make L-numbers immutable and prepare some methodology in algorithm by which alternate headwords / missed headwords etc can be added without altering L-numbers. I will wait for you to finish this before we move ahead.

funderburkjim commented 7 years ago

@drdhaval2785

After git pull origin master, I reran redo.sh and looked at accwithmeta.txt and it seems fine.

The git pull statement showed that also 'orig/acc.txt' was updated, but not orig/acc3.txt. Further examination also showed a modification in .gitignore to not track orig/acc3.txt.

acc3.txt needs to be tracked.

It is true that it can be recreated by copying accwithmeta.txt. However, unlike accwithmeta.txt, acc3.txt plays a role in subsequent updates, as seen in update.sh:

python updateByLine.py ../orig/acc3.txt manualByLine01.txt ../orig/acc.txt

I modified .gitignore accordingly.

In fact ALL the files in orig directory should be tracked. Theoretically, only the first form of the digitization (orig/acc_orig.txt) is required, and the others can be recreated by pywork/update.sh. However, it is safer to keep all the major intermediate forms present in orig directory.

recreation of acc3.txt and acc.txt from accwithmeta.txt

There are two steps, as can be seen by examination of pywork/update.sh. These steps should be done manually, by copy-pasting from update.sh to terminal session. These assume current directory of acc/pywork.

cp correctionwork/issue-cologne-130/accwithmeta.txt ../orig/acc3.txt
python updateByLine.py ../orig/acc3.txt manualByLine01.txt ../orig/acc.txt

funderburkjim commented 7 years ago

revise opinion on core.autocrlf setting: use 'input', not 'true'

When I added acc3.txt with git (git add acc3.txt), I got this confusing warning message:

warning: LF will be replaced by CRLF in orig/acc3.txt.
The file will have its original line endings in your working directory.

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

Now, I don't want CRLF to be the line ending in acc3.txt, but rather the unix LF.

So, I did "git reset head acc3.txt" to unstage acc3.

This stackexchange article had an interesting comment:

If you’re on a Linux or Mac system that uses LF line endings, then you don’t want Git to automatically 'convert them when you check out files; however, if a file with CRLF endings accidentally gets introduced, 'then you may want Git to fix it. You can tell Git to convert CRLF to LF on commit but not the other way 'around by setting core.autocrlf to input:

$ git config --global core.autocrlf input

This setup should leave you with CRLF endings in Windows checkouts, but LF endings 
on Mac and Linux systems and in the repository.

That sounds like exactly what we want. So, I changed the core.autocrlf to 'input' as mentioned:

git config --global core.autocrlf input

Then, 'git add acc3.txt' made no complaints or warnings, as expected since we know from its construction as a copy of accwithmeta.txt that it has LF ('\n') for line endings.

So let's adopt input as our standard global configuration in Git Bash.

funderburkjim commented 7 years ago

more git growing pains

Pushing the above changes to dev server failed:

On local machine:

$ git push origin master
Counting objects: 6, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 815 bytes | 0 bytes/s, done.
Total 6 (delta 3), reused 0 (delta 0)
error: Untracked working tree file 'orig/acc3.txt' would be overwritten by merge                     .
To [IPADDRESS]:/var/www/html/cologne/acc/
 ! [remote rejected] master -> master (Could not update working tree to new HEAD                     )
error: failed to push some refs to 'dpjf@[IPADDRESS]/var/www/html/cologne/acc

stackexchange to the rescue: clean dev server

Investigation led to stackexhange discussion . In our case, we need to remove from dev server the untracked file acc3.txt.

FIrst, a dry run to show untracked files of the repository. This is via an ssh connection to dev server:

dpjf> git clean -n -X
Would remove acc3.txt
Would remove temp_acc_xxd.txt

This looks right -- we want to remove these from dev server repository. '-f' does the removal.

dpjf> git clean -f -X
Removing acc3.txt
Removing temp_acc_xxd.txt

No commit is needed here on dev server. This git status confirms:

dpjf> git status
On branch master
nothing to commit, working directory clean

Now, back on local machine -- the push works now:

$ git push origin master
Counting objects: 6, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 815 bytes | 0 bytes/s, done.
Total 6 (delta 3), reused 0 (delta 0)
To IPADDRESS:/var/www/html/cologne/acc/
   d238289..e70fc29  master -> master

Whew!

gasyoun commented 7 years ago

Whew!

A hard day.

drdhaval2785 commented 7 years ago

Learning deeper git on the way. I also had to do some debugging when there was a file modified by both Jim and me. Had to manually resolve conflicts and commit. But with git, nothing is permanently lost. Some hassles, yes.

funderburkjim commented 7 years ago

Here are some notes written in a temporary issue regarding construction of the important acchw2.txt file.

Comments, suggestions solicited.

If these ideas still seem right tomorrow, I'll generate hw2.py code to implement the ideas.

funderburkjim commented 7 years ago

significance of `<e>` in meta line

For 7 entries, the construction of meta line has an extra parameter <e>2. First case:

<L>144<pc>1-004,1<k1>aGoraSivaAcArya<k2>aGoraSiva AcArya<e>2
{#aGoraSiva AcArya#}¦ Quoted in Śaivadarśana of Sa-
<>rvadarśanasaṃgraha. Oxf. 246^a.
<HI1>Kriyākramoddyota. Burnell 207^a.
<HI1>Tattvatrayanirṇayavyākhyā. Mysore 4.
<HI1>Tattvaprakāśikāvṛtti. Burnell 111^a. Śivatattva-
<>prakāśikāvṛtti. Burnell 111^a. Mysore 4.
<HI1>Tattvasaṃgrahalaghuṭikā. Burnell 111^a.
<HI1>Nādakārikāvṛtti. L. 1434. Burnell 111^a. 
[Page1-004-b+ 45]
<HI1>Paddhati. Poona 337.
<HI1>Sarvajñānottaravṛtti. Burnell 111^a.
<LEND>

@drdhaval2785 What is this about?

drdhaval2785 commented 7 years ago

@funderburkjim re <e>2 They are print errors where period is placed after the headword. No other headword has period after it. So to enable return journey without information loss, this extraInfo was added.

gasyoun commented 7 years ago

So to enable return journey without information loss, this extraInfo was added.

To make print errors legal?

drdhaval2785 commented 7 years ago

To make print errors legal?

Till we remove them, yes. Once the headwordwithmeta.txt is stable, we can process them as regular print error corrections and close the issue. A new CORRECTIONS issue would be in order.

gasyoun commented 7 years ago

Till we remove them, yes.

That's what I thought, ok.

funderburkjim commented 7 years ago

OK. This kind of irregularity will probably occur in many dictionaries as we convert to xxxwithmeta.txt form, and try for invertibility. For this particular acc dictionary, the irregularity was simple enough to handle as you did. Another solution method would be to hard code quirks in the invert.py program for a particular dictionary.

funderburkjim commented 7 years ago

Revisions to redo_hw and make_xml

During implementation of the ideas described in the temporary issue, several changes were made. These changes make the system both conceptually simpler and slightly more general.

acc.txt + acc_hwextra.txt --> acchw.txt

The hw.py program combines the meta-lines of acc.txt and the lines of acc_hwextra.txt (for the alternate (and sub) headwords) into the acchw.txt file. This file was not part of the prior system. For the sake of this description, lets call either one of these two kinds of lines a general meta line.

To each general meta line is added the two linenum1/2 fields (indicating the line range of acc.txt corresponding to the entry designated by the general meta line), and the result becomes a line of acchw.txt.

Example of acchw records:

From first acc meta line
<L>1<pc>1-001,1<k1>aMSadaSA<k2>aMSadaSA<ln1>3<ln2>5
From an alternate headword line from acc_hwextra.txt:
<L>12.1<pc>1001,1<k1>akzacaraRa<k2>akzacaraRa<type>alt<LP>12<k1P>akzapAda<ln1>39<ln2>42

fields of meta lines of acc.txt

required fields: L, pc, k1, k2 optional field: hom

fields of acc_hwextra.txt

required fields: L, pc, k1, k2, type, LP, k1P optional field ?: hom -- not sure if this ever required for an alternate or sub headword

fields of an acchw.txt record :

required fields
- L Cologne record identifier
- pc page-col reference to scanned image
- k1 key1. The headword spelling, in slp1 coding for Sanskrit headwords
- k2 key2. The original headword spelling, either in slp1 or IAST
- ln1 linenum1 = line number of first line of acc.txt for the associated entry [a 'metaline']
- ln2 linenum2 = line number of last line of acc.txt for the associated entry []
optional field for homonym
- hom The homonym number (usually a digit). Not present in acc dictionary
required fields for an alternate headword
- type : alt (currently). Anticipate also 'sub' for subheadword. Maybe other types not yet developed.
- LP : L-code of 'parent' headword.
- k1P : key1 of 'parent' headword

`<key>val` sequence for format of acchw.txt records

This is a more convenient format than a colon-delimited value sequence such as used by acchw2.txt and acchw0.txt. Optional fields may just be omitted. The presence of <key> reminds us of what the field represents.

acchw.txt -> acchw2.txt and acchw0.txt

Since acchw2.txt is used elsewhere (notably in construction of sanhw1 - headwords for all dictionaries), it is convenient to maintain this file in its current format. Each line of acchw2.txt is easily constructed from acchw.txt as a colon-delimited sequence of certain required fields:

pc:k1:ln1,ln2:L

acchw0 is constructed similarly, the difference being that 'k2' (key2) is used instead of 'k1' (key1).

hwparse.py

This program contains a class HW for records of acchw.txt and a function which reads acchw.txt and parses the lines into a sequence of HW objects. Each HW instance object allows access to the fields of the acchw line as instance attributes (e.g. obj.k1 for the 'k1' (key1) field). Alternately, a dictionary form of access is possible (e.g. obj.d.['k1'] ). Missing optional fields are given the Python None value.

HW class also contains some class variables, that may be useful in subsequent programs. E.g., some of these are used in the make_xml.py program.

Ldict - a dictionary associating each L-code with the corresponding acchw.txt record
Sanskrit - A boolean flag: Are the headwords Sanskrit?
dictcode - e.g. 'acc' - the Cologne dictionary identifier
hwrec_keys = the possible keys of the acchw.txt records

make_xml.py

This program is conceptually much simpler than before. It uses both acc.txt and acchw.txt as inputs. It constructs an xml structure representing all the entries of the dictionary, one entry for each acchw,txt record. It makes provision for the 'hom' element. Next comment discusses current choice in handling of alternate headwords in xxx.xml.

gasyoun commented 7 years ago

optional field ?: hom -- not sure if this ever required for an alternate or sub headword

Guess never.

k2 key2. The original headword spelling, either in slp1 or IAST

And there will be no meta code to know if it's SLP1 or IAST?

Sanskrit - A boolean flag: Are the headwords Sanskrit?

Mark some as Prakrit in future?

funderburkjim commented 7 years ago

alternate headwords in acc.xml

Currently, there is just one alternate headword implemented for acc; to get more we have to generate additional records of hwextra/acc_hwextra.txt. That's a separate consideration.

Here are the xml records for the primary and alternate headwords:

PRIMARY (akzapAda)
<H1><h><key1>akzapAda</key1><key2>akzapAda</key2></h>
 <body>
   <s>akzapAda</s>  or <s>akzacaraRa,</s> a name of Gautama, the philo- <br/>sopher,  Hall p. 20.
  </body>
  <tail><L>12</L><pc>1-001,1</pc></tail></H1>
ALTERNATE (akzacaraRa)
<H1><h><key1>akzacaraRa</key1><key2>akzacaraRa</key2></h>
<body>
   THIS LINE IS ADDITIONAL FOR ALTERNATE
  <alt><s>akzacaraRa</s> is an alternate spelling of <s>akzapAda</s></alt> 
  <s>akzapAda</s>  or <s>akzacaraRa,</s> a name of Gautama, the philo- <br/>sopher, Hall p. 20.</body>
  <tail><L>12.1</L><pc>1-001,1</pc>
    THIS LINE IS ADDITIONAL FOR ALTERNATE
    <hwtype n="alt" ref="12"/>
   </tail></H1>

Two elements are special for alternate (or sub) headwords:

<alt> in body element. This is a simple description of the fact that we have an alternate spelling.
- If this were a sub-headword, the wording would be <alt><s>akzacaraRa</s> is a sub-headword of <s>akzapAda</s></alt>
<hwtype n="TYPE" ref="LP"/> This element in the tail indicates the type of the alternate ('alt' or 'sub' ) and the Cologne record identifier of the parent primary entry.

the xml form used for skd alternates

Here is the current skd.xml record for the alternate headword 'kuveraH':

`<H1><h n="alt"><key1>kuveraH</key1><key2>kube(ve)raH</key2></h>
<body ref="8094"></body>
 <tail><L>8094.01</L><pc>2-144</pc></tail></H1>

an attribute of <h> indicates the type of alternate
an attribute of <body> indicates the cologne record id of the parent record.
the <body> element has no text.

A downstream programs using skd.xml must contain logic to interpret these special attribute. In particular this somewhat complex interpretation logic is present in the disp.php program which generates the basic display of records for skd.

A downstream user of acc.xml will still need logic to deal with the <alt> element, but this logic should be simpler (e.g., no need to make an extra search to know that the ref value of '8094' corresponds to the 'kuberaH' spelling).

Time will tell which approach is better. Currently, I prefer the acc.xml approach, which was originally suggested by Dhaval.

funderburkjim commented 7 years ago

current status

The changes described in the prior two comments have been pushed to dev server in commit a19c3335.

Next step will be to modify disp.php to handle the <alt><hwtype> elements; and also the <div n="3"> case that Dhaval introduced.

@drdhaval2785 Do you want to give these changes to web/webtc/disp.php a try?

funderburkjim commented 7 years ago

And there will be no meta code to know if it's SLP1 or IAST?

Good observation - as currently key2 is only in IAST for some dictionaries. Maybe the right place to put this dictionary-meta piece of information is as an HW class variable within hwparse.py.

Prakrit

Have not thought about this. Are there examples? Currently the headwords of all dictionaries are (I think) either Sanskrit words or English words. The use thus far of this flag is to know how to render key1. If Sanskrit flag is True, then we use the fact that key1 field is always SLP1, regardless of whether the dictionary shows Devanagari or IAST; thus we can use transcoding to render key1 in Devanagari, IAST or whatever the user display chose. If Sanskrit flag is False, then render key1 'as is'. For instance, if we had a Russian-Sanskrit dictionary, then we would set Sanskrit flag to false, since the headwords would be in Russian.

gasyoun commented 7 years ago

currently key2 is only in IAST for some dictionaries

What is the need for it? Or why some are SLP1? Does not make sense to me - the diversity.

Are there examples?

There were, but lost again.

funderburkjim commented 7 years ago

Does not make sense to me - the diversity.

Good point. I'm not sure how to resolve. When we are next working on such a dictionary (one with IAST headwords in print), we should be alert. Maybe a solution will present itself when we see the exact details in such a dictionary.

funderburkjim commented 7 years ago

@drdhaval2785 I got an email of a comment 'acc.xml doesn't have akzacaraRa entry in it.'

but don't see it in the comments now -- presume you solved this problem (by update_sync.sh)?

sanskrit-lexicon / COLOGNE

Specific issues for converting acc.txt to have identical headword line #133

130 mandates that we create one uniform headword line.

special end of entry code needed in xxx.txt

git and line endings

a possible solution for us

Why diff.py is better than 'diff -w'

unixify.py

small comment on `<LEND>`

acc3.txt needs to be tracked.

recreation of acc3.txt and acc.txt from accwithmeta.txt

revise opinion on core.autocrlf setting: use 'input', not 'true'

more git growing pains

Pushing the above changes to dev server failed:

stackexchange to the rescue: clean dev server

significance of `<e>` in meta line

Revisions to redo_hw and make_xml

acc.txt + acc_hwextra.txt --> acchw.txt

fields of meta lines of acc.txt

fields of acc_hwextra.txt

fields of an acchw.txt record :

`<key>val` sequence for format of acchw.txt records

acchw.txt -> acchw2.txt and acchw0.txt

hwparse.py

make_xml.py

alternate headwords in acc.xml

the xml form used for skd alternates

current status

sanskrit-lexicon / COLOGNE

Specific issues for converting acc.txt to have identical headword line #133

130 mandates that we create one uniform headword line.

special end of entry code needed in xxx.txt

git and line endings

a possible solution for us

Why diff.py is better than 'diff -w'

unixify.py

small comment on <LEND>

acc3.txt needs to be tracked.

recreation of acc3.txt and acc.txt from accwithmeta.txt

revise opinion on core.autocrlf setting: use 'input', not 'true'

more git growing pains

Pushing the above changes to dev server failed:

stackexchange to the rescue: clean dev server

significance of <e> in meta line

Revisions to redo_hw and make_xml

acc.txt + acc_hwextra.txt --> acchw.txt

fields of meta lines of acc.txt

fields of acc_hwextra.txt

fields of an acchw.txt record :

<key>val sequence for format of acchw.txt records

acchw.txt -> acchw2.txt and acchw0.txt

hwparse.py

make_xml.py

alternate headwords in acc.xml

the xml form used for skd alternates

current status

small comment on `<LEND>`

significance of `<e>` in meta line

`<key>val` sequence for format of acchw.txt records