sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

List of anti-sandhi letter combinations #437

Closed gasyoun closed 3 years ago

gasyoun commented 4 years ago

What combinations of n-grams are possible in Sanskrit, Jim? Based on that can we make a list of (near) impossible ones? Needed for proofreading Sanskrit books before sending them to the printer, thanks. Somehow similar to what was done with o_vs_O. Please advise @funderburkjim and @drdhaval2785

drdhaval2785 commented 3 years ago

@funderburkjim I recently saw a comment mentioning that bigrams and trigrams of all dicts is available now. Not able to locate bigram / trigram location. Can you help?

funderburkjim commented 3 years ago

In simple search, I previously used bigrams/trigrams to limit the search tree.

However, recently I changed to using sqlite 'LIKE' (select * from mw where key LIKE 'ab%' ;.

The bigram/trigram construction and results are still in csl-apidev, at https://github.com/sanskrit-lexicon/csl-apidev/tree/master/simple-search/ngram1

drdhaval2785 commented 3 years ago

@gasyoun

Jim has provided the valid bigrams, trigrams. Whatever is not here, is invalid. It is quite useless to create a large negative list. Can we close this issue?

gasyoun commented 3 years ago

The bigram/trigram construction and results are still in csl-apidev, at https://github.com/sanskrit-lexicon/csl-apidev/tree/master/simple-search/ngram1

Thanks. @funderburkjim can we limit bigram/trigram construction to those that can occur: 1) in the beginning of a word (is 2gram_beg.txt exactly about it?) 2) in the middle 3) at the end?

funderburkjim commented 3 years ago

in the beginning of a word (is 2gram_beg.txt exactly about it?)

Yes.

2gram.txt are those that occur anywhere -- probably same as 'middle'

There is no 2gram_end.txt.

It's easy to construct such lists. As I mentioned, ngrams are superceded in recent versions of simple-search by using a 'LIKE' selection technique.

So there is no current need for ngram lists. We can revisit the ngram1 directory later if there is further need of ngrams.

gasyoun commented 3 years ago

2gram.txt are those that occur anywhere -- probably same as 'middle'

Right, understood.

It's easy to construct such lists.

Can you, please? It would easy verification of Sanskrit text outside Cologne as well.