Open funderburkjim opened 9 years ago
Interesting observation. Hmm, I've seen them before and if I would have some effect, would love to see them sooner than later. The more I wonder if these 2975 cases match with MW or are above his lexicon's reach. In the .xml file (http://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/downloads/pwxml.zip) it's <noti>und</noti>
, but that gives 9105 cases.
There are good ones, like
<H1><h><key1>hrAduni</key1><key2>hrAdu/ni</key2></h><body><noti>und</noti> <s>hrAdu/nI</s>
There are harder cases, that are not covered with Jim's regex, but still should be counted:
<H1><h><key1>hOtrakalpadruma</key1><key2>hOtrakalpadruma</key2></h><body><gram n="n">n.</gram> <noti>und</noti> <s>hOtrasUtra</s>
And false positives that do not relate to headwords:
<H1><h><key1>hvArya</key1><key2>(hvArya)</key2></h><body><s>hvAria/</s> <gram n="Adj">Adj.</gram> <i>colubrinus</i> <noti>oder</noti> <i>geschmeidig , sich durchwindend</i> , <gram n="m">m.</gram> <noti>angeblich</noti> <i>Ross</i> <noti>und</noti> <i>Schlange.</i>
What I wanted to show with these samples is that there can occur 1-2 other tags between <body>
and the <noti>und</noti>
train.
¦ •[mfn][.] ‹und› #
hrAdu/nI
in a 'key2' form, from which the key1 hrAdunI
is readily derived.{agnipraveSa}1{agnipraveSa}¦ •m. ‹und› #{°praveSana} •n.
where the full alternate headword is agnipraveSana
Would love to see the list. This + praefix-root verbs in PW and PWK is the thing I need the most at now for my Reverse Dictionary. And AS replaced by Unicode in etymologies once and for all. It's all I ask for.
It seems that at least a first pass at a partial list of alternate headwords for PW could be derived programatically based on the observations made above. It would be a matter of applying a regex to either pw.txt or pw.xml.
To perfect this list would doubtless involve numerous revisions, whose details cannot be estimated ahead of time.
Please apply regex woodoo. I will review the results files to see where it will fail.
these alternate headwords would be accessible as headwords
Is still million years ahead, @funderburkjim ?
To perfect this list would doubtless involve numerous revisions, whose details cannot be estimated ahead of time.
@funderburkjim
With my v.2 data (which has all the 'grouped' words identified and marked) incorporated into cdsl file, this issue can be closed.
And this doesn't involve numerous revisions, but just a single one!
In checking a potential correction for PW, I noticed a feature from this example:
The feature is that SrotaHpati is an additional headword, presented in the text as an alternate to srotaISa.
The pattern
¦ ‹und›
occurs in 2975 cases, and, from a brief examination, appears usually to indicate an alternate headword.I'm not sure how to specifically handle these cases . But in a more perfect coding of PW, these alternate headwords would be accessible as headwords; and I wanted to mention this here as the subject of some future enhancement to PW.