sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

hyphenated words in full-text search #75

Open funderburkjim opened 8 years ago

funderburkjim commented 8 years ago

In a comment regarding a hyphenated Greek word (case 270 here, @jmigliori brings up a general issue.

I'm thinking the issue is relevant for full-text searches in all the dictionaries where the basic form of the digitization is faithful to the line-breaks of the dictionary. The issue involves words which are hyphenated in such digitizations.

For instance, in MW72, the word 'diamond' appears in only a hyphenated form under the headword akza (SLP1):

argument of the latitude. — Aksha-ja, as, m. a dia- 
mond; a thunderbolt; a N. of Vishṇu. — Aksha-dhur, 

This issue has not been addressed at all in the full-text search function currently evident in the 'Advanced Search' of the displays.

For instance, using the advanced search for MW72, and searching an exact match for 'diamond', one does NOT find this instance.

Solution of this problem is beyond our current techniques, but the problem deserves attention when the time is right.

gasyoun commented 8 years ago

Indeed I'm aware of the issue, but have never seen a solution. A regex filtering out any - would be enough, no?

funderburkjim commented 8 years ago

@gasyoun Doing something straightforward with hyphens would be a reasonable way to start.

I continue to think that pouring the dictionaries into a mature search engine framework like Lucene (or its extension, Solr), or Elastisearch would be an important enhancement to the Advanced Search functionailty. However, I am still intimidated by the details of this.

gasyoun commented 8 years ago

I do not think that it's time. Let's continue killing batch errors. Where can I help?

funderburkjim commented 8 years ago

@gasyoun You could help by identifying what needs to be done.

Is @Shalu411 still working on something?

I've haven't been thinking much about corrections in the last couple of months - just staying abreast of the user 'Correction Form' items. I've been more interested in the apidev material. I've also become intrigued with understanding Scharf's sandhi program -- the documentation of this is not yet published, though the programs (Java and Python versions) are functioning in the repository.

So if you identify where work needs to be done on improving the headword lists, that might reignite my interest there.

gasyoun commented 8 years ago

@Shalu411 will get back in two weeks and I'll ask her to get back to the corrections. @drdhaval2785 would be a better choise in diving into sandhi issues, if I'm to decide, as he has the background needed and skills available. APIDEV is for the web version only, for interaction. Let me see what tasks I can recognize - I'm interested in the material, not that much in the representation, as it has become much better anyway lately. Enough for now for me atleast.

funderburkjim commented 8 years ago

Re sandhi, I want to make the code comprehensible to myself, as part of my education. A byproduct of this is that the code will be comprehensible to others, such as @drdhaval2785 , who may wish to evaluate the code based on their prior understanding of Paninian grammar. A second byproduct will be the ability to compare this Paninian approach to sandhi to other approaches, such as that of Bucknell. A third byproduct will be the ability to enhance the sandhi to take into account some relatively obscure options that Peter programmed into his original Pascal program, but which were not included in my Java conversion. A fourth byproduct will be the inclusion of a 'history' option, whereby the rules used in a given sandhi derivation may be displayed; I have a beginning to this with the use of a Python decorator applied to the various sandhi methods.

So, despite this having no direct bearing on the dictionaries, I intend to continue with this.

gasyoun commented 8 years ago

Can I recommend you "Sanskrit Sandhi and Exercises (1968)"? A colleague scanned and sent it to me, or are there enough training examples? The number of byproducts is astounishing. But I do hope I can get you back to the cleanup soon :dart:

funderburkjim commented 8 years ago
  1. The Scharfsandhi repository now is available. It includes a comparison of computed sandhi v. Bucknell (short conclusion - almost identical).
  2. I'd like to see the Sanskrit Sandhi and Exercises.
gasyoun commented 8 years ago
  1. Amazing. Brain starts to boil. 697 cases are derived from Bucknell's consonant sandhi table. + 21 cases are derived from Bucknell's vowel sandhi table. https://github.com/funderburkjim/ScharfSandhi/blob/master/pythonv4/scharfsandhi_bucknell.md
  2. Sent.
  3. Have I told you I've written a review on Bucknell? http://ores.su/en/journals/izvestiya-rossijskogo-gosudarstvennogo-pedagogicheskogo-universiteta-im-ai-gertsena/2008-nomer-82-1/a174118
  4. In a month I'm reprinting Buhler's Sanskrit Leitfaden and it contains a sandhi table https://github.com/sanskrit-lexicon/Cologne/blob/master/Sandhi-Table-27.04.15-Likhushina-ed.pdf might be of interest to learn the differences.
funderburkjim commented 8 years ago

Buhler sandhi table is interesting. Do you have it in a digitized form that a program could parse?

One detail is that the table shows TWO results in several cases. It might be interesting to determine whether various option choices in Scharf's program can generate the same set of options.

I looked at your review (as translated by Google). From the translation, I got the impression that your review is a 'review of a review' - I.e., that somone compared Bucknell's approach to that of several other books; and that you are providing a synopsis of this comparison. But maybe this impression is wrong, and is an artifact of the translation from Russian to English.

gasyoun commented 8 years ago

I've got it as a word file of the same table you saw in .pdf. So it would not help you much, I guess, but is the most well known sandhi table. Bucknell is great, but far lesser known. As I remember the review was jus a review and not of another review.

funderburkjim commented 8 years ago

@gasyoun Word file of Buhler sandhi table might be helpful.

gasyoun commented 8 years ago

@funderburkjim easy one https://github.com/sanskrit-lexicon/Cologne/blob/master/Sandhi-Table-27.04.15-Likhushina-ed.doc The original German table https://dl.dropboxusercontent.com/u/34967951/Buhler-Leitfaden1927.pdf

funderburkjim commented 8 years ago

@gasyoun The doc file opens well in Google Docs.

Question: Was this generated from scan via OCR software? If so, which software?

The reason I ask, is that I've been working on a digitization of Ballantyne's LaghuSiddhanta. When I did OCR using tesseract, the result was very poor with the diacritics of IAST. However, in your sandhi table 'doc', the IAST looks well-identfied.

gasyoun commented 8 years ago

No OCR, it was recreated. Tessarect is no good, this why for last 13 years I use ABBYY https://www.youtube.com/watch?v=Y82i2iUiKF4 - you can teach a template and becomes far more less poor and even good.