sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

Many alternate headwords #10

Open funderburkjim opened 9 years ago

funderburkjim commented 9 years ago

In checking a potential correction for PW, I noticed a feature from this example:

<H1>100{srotaISa}1{*srotaISa}¦ ‹und› #{srotaHpati} •m. {%das Meer.%} PW131930

The feature is that SrotaHpati is an additional headword, presented in the text as an alternate to srotaISa.

The pattern ¦ ‹und› occurs in 2975 cases, and, from a brief examination, appears usually to indicate an alternate headword.

I'm not sure how to specifically handle these cases . But in a more perfect coding of PW, these alternate headwords would be accessible as headwords; and I wanted to mention this here as the subject of some future enhancement to PW.

gasyoun commented 8 years ago

Interesting observation. Hmm, I've seen them before and if I would have some effect, would love to see them sooner than later. The more I wonder if these 2975 cases match with MW or are above his lexicon's reach. In the .xml file (http://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/downloads/pwxml.zip) it's <noti>und</noti>, but that gives 9105 cases. There are good ones, like

<H1><h><key1>hrAduni</key1><key2>hrAdu/ni</key2></h><body><noti>und</noti> <s>hrAdu/nI</s>

There are harder cases, that are not covered with Jim's regex, but still should be counted:

<H1><h><key1>hOtrakalpadruma</key1><key2>hOtrakalpadruma</key2></h><body><gram n="n">n.</gram> <noti>und</noti> <s>hOtrasUtra</s>

And false positives that do not relate to headwords:

<H1><h><key1>hvArya</key1><key2>(hvArya)</key2></h><body><s>hvAria/</s> <gram n="Adj">Adj.</gram> <i>colubrinus</i> <noti>oder</noti> <i>geschmeidig , sich durchwindend</i> , <gram n="m">m.</gram> <noti>angeblich</noti> <i>Ross</i> <noti>und</noti> <i>Schlange.</i>

What I wanted to show with these samples is that there can occur 1-2 other tags between <body> and the <noti>und</noti> train.

funderburkjim commented 8 years ago
¦ •[mfn][.] ‹und› #
{agnipraveSa}1{agnipraveSa}¦ •m. ‹und› #{°praveSana} •n. 
  where the full alternate headword is agnipraveSana
gasyoun commented 8 years ago

Would love to see the list. This + praefix-root verbs in PW and PWK is the thing I need the most at now for my Reverse Dictionary. And AS replaced by Unicode in etymologies once and for all. It's all I ask for.

funderburkjim commented 8 years ago

It seems that at least a first pass at a partial list of alternate headwords for PW could be derived programatically based on the observations made above. It would be a matter of applying a regex to either pw.txt or pw.xml.

To perfect this list would doubtless involve numerous revisions, whose details cannot be estimated ahead of time.

gasyoun commented 8 years ago

Please apply regex woodoo. I will review the results files to see where it will fail.

gasyoun commented 7 years ago

these alternate headwords would be accessible as headwords

Is still million years ahead, @funderburkjim ?

Andhrabharati commented 7 months ago

To perfect this list would doubtless involve numerous revisions, whose details cannot be estimated ahead of time.

@funderburkjim

With my v.2 data (which has all the 'grouped' words identified and marked) incorporated into cdsl file, this issue can be closed.

And this doesn't involve numerous revisions, but just a single one!