Closed gasyoun closed 1 year ago
As usual (for you), you forgot that it was you who got the HW body portion split at the sense changes of nouns etc. [rather at (almost) every semicolon, irresp. of change of meaning 'sense']; during which, a new L-number was allotted for all such splits.
So, the L-count does not indicate the actual "proper" HW count in MW!!
grep -E "^<L>.*[0-9]$" temp_mw_07.txt | wc -l
193763
This ignores the 'A', 'B','C' subcategories. While count of <L>
alone in mw gives too many (as @Andhrabharati explains),
this count (193763) is slightly too low (e.g., it will not count <L>119390.1<pc>606,1<k1>paruzam<k2>paruza/m<e>2C
).
But this 193763 is closer to the real count of distinct headwords in MW.
I don't think regexes alone can find the real count. But I think the number of lines in extract_keys_b.txt does give the real count (currently 194048). This file may be found at https://sanskrit-lexicon.uni-koeln.de/scans/MWScan/2020/pywork/mwkeys/extract_keys_b.txt (or at the analogous location in a local installation)
A simple program counts the number of distinct headwords in a dictionary. For MW:
$ python hwcount.py /c/xampp/htdocs/cologne/csl-orig/v02/mw/mw.txt
880542 lines read from C:/xampp/htdocs/cologne/csl-orig/v02/mw/mw.txt
287587 entries found
194049 distinct k1 headwords in C:/xampp/htdocs/cologne/csl-orig/v02/mw/mw.txt
This program applies to any dictionary. e.g., for the new lrv:
$ python hwcount.py /c/xampp/htdocs/cologne/csl-orig/v02/lrv/lrv.txt
190412 lines read from C:/xampp/htdocs/cologne/csl-orig/v02/lrv/lrv.txt
47603 entries found
42994 distinct k1 headwords in C:/xampp/htdocs/cologne/csl-orig/v02/lrv/lrv.txt
@gasyoun A good exercise for you would be to use hwcount.py to generate a similar list for all the Cologne dictionaries.
Good to see LRV in here.
What is pending to make it public at the koeln's CDSL site?
What is pending to make it public at the koeln's CDSL site?
@funderburkjim what remains?
287588 - is it the number of words in Monier, @funderburkjim, or have I missed something?
https://sanskrit-lexicon.uni-koeln.de/scans/MWScan/2020/web/webtc/download.html