Closed gasyoun closed 3 years ago
@funderburkjim I recently saw a comment mentioning that bigrams and trigrams of all dicts is available now. Not able to locate bigram / trigram location. Can you help?
In simple search, I previously used bigrams/trigrams to limit the search tree.
However, recently I changed to using sqlite 'LIKE' (select * from mw where key LIKE 'ab%' ;
.
The bigram/trigram construction and results are still in csl-apidev, at https://github.com/sanskrit-lexicon/csl-apidev/tree/master/simple-search/ngram1
@gasyoun
Jim has provided the valid bigrams, trigrams. Whatever is not here, is invalid. It is quite useless to create a large negative list. Can we close this issue?
The bigram/trigram construction and results are still in csl-apidev, at https://github.com/sanskrit-lexicon/csl-apidev/tree/master/simple-search/ngram1
Thanks. @funderburkjim can we limit bigram/trigram construction
to those that can occur:
1) in the beginning of a word (is 2gram_beg.txt exactly about it?)
2) in the middle
3) at the end?
in the beginning of a word (is 2gram_beg.txt exactly about it?)
Yes.
2gram.txt are those that occur anywhere -- probably same as 'middle'
There is no 2gram_end.txt.
It's easy to construct such lists. As I mentioned, ngrams are superceded in recent versions of simple-search by using a 'LIKE' selection technique.
So there is no current need for ngram lists. We can revisit the ngram1 directory later if there is further need of ngrams.
2gram.txt are those that occur anywhere -- probably same as 'middle'
Right, understood.
It's easy to construct such lists.
Can you, please? It would easy verification of Sanskrit text outside Cologne as well.
What combinations of n-grams are possible in Sanskrit, Jim? Based on that can we make a list of (near) impossible ones? Needed for proofreading Sanskrit books before sending them to the printer, thanks. Somehow similar to what was done with o_vs_O. Please advise @funderburkjim and @drdhaval2785