sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

eascii: extended ascii in cologne dictionaries #368

Open funderburkjim opened 3 years ago

funderburkjim commented 3 years ago

The digitization files xxx.txt (of csl-orig repository) contain many characters besides the standard ascii characters. These are represented in the utf-8 encoding. We use the term 'extended ascii' to refer to any character other than a standard ascii character.

For various reasons, it is useful to survey which extended ascii characters appear in which dictionaries. The 'eascii' directory in this repository aims to provide such survey information.

funderburkjim commented 3 years ago

There are two kinds of summaries: by dictionary and by character.

by dictionary

For each dictionary, show all the extended ascii characters that occur in the dictionary. The eadata directory contains one file for each dictionary. For instance, ea_acc.txt shows all extended ascii characters occurring in the acc dictionary. The lines are ordered alphabetically according to the Unicode code point, and show

One detail is that only lines of the file occurring within an entry are considered; excluded are lines representing front matter, appendices, etc. that have not been marked as entries. By convention, an entry begins with a 'metaline' (starts with <L>) and ends with the line <LEND>, and includes all lines between these two lines.

In csl-orig repository, there is an xxx_meta2.txt file for each dictionary, and one component of xxx_meta2 is a listing of the extended ascii characters. For instance, compare ea_acc.txt with acc_meta2.txt. We should strive to have consistency between these two ea lists.

all_ea.txt contains all the individual dictionary files.

funderburkjim commented 3 years ago

by character summary

For some purposes, it is useful to see all the dictionaries which contain a particular character. the 'easummary' files serve this purpose.

There are summary files: