Open gasyoun opened 10 years ago
The bad thing about making things public is that all the forgotten dirty laundry might be seen!
Yes, the number of dashes separating 'padas' in key2 in MW is ugly, and irregular.
In 2013, I spent several months trying to rationalize key2. The motivating idea was to unravel the etymological information packed into key2.
For instance, take the headword mantrArTAdIpa, whose key2 is
mantrA<srs/>rTA---dIpa
This was analyzed as
157434,H4,mantrA@rTA-dIpa => <CPD><hw hw="mantrArTa" r="f.">157432,H3,mantrArTA</hw> <hw1>92770,H2,dIpa</hw1></CPD>
One use of this analysis could be to provide hyperlinks to mantrArTa and dIpa when a user looks up mantrArTAdIpa.
Similarly, the analysis of mantrArTa is
157432,H3,mantrA@rTa => <SRS><hw>157237,H2,mantra</hw> <hw1>15840,H1,arTa</hw1></SRS>
This analysis was carried rather far, probably to roughly 85-90% of the MW headwords to which such analysis is applicable.
I still think this would be an interesting task to revisit and complete.
Until such analysis is completed, it seems premature to worry about the oddities of the number of dashes in key2.
If you can provide some readme or documentation of earlier effort, it would be beneficial to all @funderburkjim
This is easier said than done. In preliminary look at documentation, there is difficulty in finding what the final status was - there seem to be several paths followed, some of which were probably false. If the author can't readily understand what was done, probably someone else could even less so. I think the best path would be to work through the 3000 line readme file, and recreate the work. As progress is made in this, I'll post it.
Note to self: the readme file is at lgtab1/mapnorm
@funderburkjim I highly value your effort to make every single detail public, the more I value it because I know (good enough) that all the forgotten dirty laundry
will be seen. Working a year on Sanskrit grammar tables I forgot myself what half of my abbreviations meant, so yes I do understand your point.
Are you sure about number of dashes separating 'padas' in key2 in MW is ugly, and irregular.
It's useless? 2 or 3 dashes - just a random generation?
As per I still think this would be an interesting task to revisit and complete.
- seems similar to what Huet has done with Pawan? To get even more dirty laundry - could you unreveal the secret sources from which you take out those pythons and .txt files? There seems to be an endless ocean of them. Somewhere on your HDD. I just wonder if they'll be gone one day. If backed up here somewhere around, we could have at least a chance. Now we can only pray you do get there one day.
@gasyoun re "Are you sure about number of dashes separating 'padas' in key2 in MW is ugly, and irregular. It's useless? 2 or 3 dashes - just a random generation?"
At one time I thought the right principle was to have have one dash separating H1&H2 base from H3 portion, and two dashes separating H3 base from H4 portion. However, this informal rule has been applied so irregularly, and there are so many divergences from even this rule (as your '-----' shows), that the current Number of dashes is not useful.
However, as mentioned in https://github.com/funderburkjim/MWlexnorm/blob/master/step1a/readme.txt , I do think the dashes in key2 have use, just not the number of dashes. I hope the (not yet finished) work on rationalizing key2 in MW can be redone under MWlexnorm. In this process, a more useful key2 form may emerge. This more useful form will take into account not only the hyphens, but also the 'sr' and 'srs' markup present in key2. When brought to the state I anticipate, key2 will provide both useful intra-headword relationships, and also part of the information needed to accurately generate declensions and feminine forms for substantives.
@drdhaval2785 re 'If you can provide some readme or documentation of earlier effort, it would be beneficial to all'
How is the documentation (via readmes) so far for step0 and step1a of https://github.com/funderburkjim/MWlexnorm ?
@funderburkjim Good enough. But no details about the trials to normalise seemingly irregular hyphens. I want you to document that, so that wrong path may be detected and corrected.
@drdhaval2785 My intent with the MWlexnorm repository is to entirely redo that earlier analytical work. Currently, there has been so much good activity with corrections, that I haven't had a chance to continue this revision. When I do revisit this, my intention is to robustly document each step as it proceeds, so that others may be able to make constructive suggestions.
We are not in a hurry. What I really worry about is the pureness of headwords, because at it seems in a few months from now my Reverse dictionary might actually get printed and every word counts. Lexical data would make it a Grammar dictionary instead of just reverse, but as there is only a single Jim around, hardly possible till early 2015, I guess.
#sveda#
#svéda#
#a-sveda#
#saṁ-sveda#
#antáḥ--sveda#
#píṇḍa--sveda#
#guptá--sveda#
#upa-sveda#
#tāpa--sveda#
#sopasveda#
#púṣpa--sveda#
#gharmá--sveda#
#uṣma--sveda#
#saṁ-kara---sveda#
#pra-stará---sveda#
#pra-sveda#
#pāda--prasveda#
#sa--prasveda#
#dravá--sveda
Why #pāda--prasveda#
and #sopasveda#
unsplit when there is no sandhi? @drdhaval2785 can you think of any automation for such to-be split cases? If sveda
is already a known legal ending for compositas, can it be marked in such cases where there is no sandhi? Or even when there is a sandhi, to mark it as in MW?
@funderburkjim any work in lexnorm recently as you said around a year ago. Just curious.
tempus fugit.
I still think that the continuation of the lexnorm work has merit, but other things seem to take precedence, so I have not furthered that work. Alas.
I have (falsely) counted that there are
so many cases. Are you sure
------
inpāri--jāta------haraṇa-campū
should remain or just----
?