sanskrit-lexicon / MWS

Monier Monier-Williams, Sir; A Sanskrit-English dictionary. Oxford, 1899
Other
7 stars 5 forks source link

Headword Stats #143

Closed gasyoun closed 1 year ago

gasyoun commented 1 year ago

287588 - is it the number of words in Monier, @funderburkjim, or have I missed something?

https://sanskrit-lexicon.uni-koeln.de/scans/MWScan/2020/web/webtc/download.html

dsfsddssdf

Andhrabharati commented 1 year ago

As usual (for you), you forgot that it was you who got the HW body portion split at the sense changes of nouns etc. [rather at (almost) every semicolon, irresp. of change of meaning 'sense']; during which, a new L-number was allotted for all such splits.

So, the L-count does not indicate the actual "proper" HW count in MW!!

funderburkjim commented 1 year ago

another count

 grep -E "^<L>.*[0-9]$" temp_mw_07.txt | wc -l
193763

This ignores the 'A', 'B','C' subcategories. While count of <L> alone in mw gives too many (as @Andhrabharati explains), this count (193763) is slightly too low (e.g., it will not count <L>119390.1<pc>606,1<k1>paruzam<k2>paruza/m<e>2C). But this 193763 is closer to the real count of distinct headwords in MW.

I don't think regexes alone can find the real count. But I think the number of lines in extract_keys_b.txt does give the real count (currently 194048). This file may be found at https://sanskrit-lexicon.uni-koeln.de/scans/MWScan/2020/pywork/mwkeys/extract_keys_b.txt (or at the analogous location in a local installation)

funderburkjim commented 1 year ago

hwcount.py

A simple program counts the number of distinct headwords in a dictionary. For MW:

$ python hwcount.py /c/xampp/htdocs/cologne/csl-orig/v02/mw/mw.txt
880542 lines read from C:/xampp/htdocs/cologne/csl-orig/v02/mw/mw.txt
287587 entries found
194049 distinct k1 headwords in C:/xampp/htdocs/cologne/csl-orig/v02/mw/mw.txt

This program applies to any dictionary. e.g., for the new lrv:

$ python hwcount.py /c/xampp/htdocs/cologne/csl-orig/v02/lrv/lrv.txt
190412 lines read from C:/xampp/htdocs/cologne/csl-orig/v02/lrv/lrv.txt
47603 entries found
42994 distinct k1 headwords in C:/xampp/htdocs/cologne/csl-orig/v02/lrv/lrv.txt

@gasyoun A good exercise for you would be to use hwcount.py to generate a similar list for all the Cologne dictionaries.

Andhrabharati commented 1 year ago

Good to see LRV in here.

What is pending to make it public at the koeln's CDSL site?

gasyoun commented 1 year ago

What is pending to make it public at the koeln's CDSL site?

@funderburkjim what remains?