Corrections to Burnouf IAST

funderburkjim commented 5 years ago

In the review of sanskrit coding conventions, it was noticed (see):

However, the non-italic Sanskrit proper names have not been converted to modern IAST; with @sanskritisampada 's help to identify the non-italic Sanskrit words, these will also soon be converted to modern IAST.

This work has now been done. This issue aims to provide some documentation.

funderburkjim commented 5 years ago

plainwords

The files mentioned are in this Burnouf/iastwork directory.

There are some Sanskrit words in Burnouf that appear in plain text, in IAST. But these have not been converted to Sanskrit IAST, rather they remain in Burnouf's IAST. We want to identify these words and then change the spelling to standard IAST for Sanskrit.

The first step is to generate a list of all 'words' that appear in plain text in the Burnouf digitization. There are about 20000 distinct such words identified.

Of course, many of these words are French. Using the pyenchant python library, and a related dictionary of French words, we may identify many of the 20000 words as French.

Note on pyenchant: Although still available and working with Python 2.7, apparently this library is no longer maintained. Here is the repository for it. It would be good to know if there is some replacement for this library, which will work with Python 3, since Python 2 will become obsolete in 2020.

This filter resulted in plainwords_french.txt (15202 words) and plainwords_other.txt (4843).

Each of these files shows a word on each line, along with how often it occurs (as a plain word) in the Burnouf digitization.

The program using pyenchant is fr_pyenchant.py.

funderburkjim commented 5 years ago

Initial work

An html file was prepared to help in providing context to the 4500 'other' words. (plainwords_other.html).

At this point, the task was turned over to @sanskritisampada . Her goal was to mark the French words (with an 'F') and the Sanskrit words (with an 'S') in the list of plainwords_other words.

Even with context, this is a difficult task; partly due to the nature of the Burnouf dictionary:

cognate words in many languages
scientific (Latinate) names of plants and animals
modern versions of place names
probably other word categories not yet noticed.
Probable French words incorporated into French from Sanskrit.

funderburkjim commented 5 years ago

Google word detection tool

Sampada reported that this word identification was quite slow-going. This prompted a search for ways to speed the process, and somewhere along the line I became aware of the language detection functionality of Google Translate. In particular, there is a Python api, as described [here].

This was adapted for the current purposes in the sample_detect.py program.

After merging with what had been done thus far, the result was burnouf_sampada_detect.txt. Note that each line now has, in addition to each word and its frequency,

a placeholder for the word identification
The language according to the Google language detection tool
a confidence number, also provided by the language detection tool.

Interestingly, even though the language detection is often quite odd, Sampada found it sped up the process of identification.

The end result of the identifications thus far is in burnouf_sampada_detect_all.txt, with

1525 words marked as French
968 marked as Sanskrit
86 marked as place names ('P')

All in all, about 2579 were marked, and 2264 remain unmarked.

funderburkjim commented 5 years ago

French corrections

During the process of marking, Sampada identified many spelling corrections for French words. With some editing, these were converted into digitization correction transactions, about 270.

funderburkjim commented 5 years ago

Sanskrit corrections and markup

The plainwords identified as Sanskrit were examined with regard to their spelling correctness in light of modern IAST spelling conventions. As mentioned in the discussion of Burnouf's use of diacritics in representing Sanskrit words, many of these conventions differ from the modern IAST conventions. Spelling changes were made so that the resulting digitization uses modern IAST spellings for these non-italic Sanskrit words.

After such modernization changes, the resulting Sanskrit words were converted to SLP1 and compared to the spellings of headwords in the Monier-Williams dictionary. This resulted in several corrections to spellings (for instance, 'Crishna' was changed to 'Kṛṣṇa' in 3 places.)

The identified Sanskrit words, whether needing correction or not, were entered in a form which maintains their identification as Sanskrit words:

<s1 slp1="tretAyuga">Tretāyuga</s1>

This markup form had previously been used for a similar purpose in the revision to the MW digitization.

All the Sanskrit plain word digitization changes are present in the manualByLine_sancorr.txt file.

gasyoun commented 5 years ago

'Crishna' was changed to 'Kṛṣṇa'

This is amazing.

drdhaval2785 commented 3 years ago

@funderburkjim the IAST conversion in BUR be treated over?

All in all, about 2579 were marked, and 2264 remain unmarked.

This line stopped me pressing close button.

funderburkjim commented 3 years ago

There appears to be more that could be done to improve Burnouf , starting with further examination based on burnouf_sampada_detect_all.txt.

gasyoun commented 3 years ago

There appears to be more that could be done to improve Burnouf

Let a French Sanskrit scholar be born and finalize it.

sanskritisampada commented 3 years ago

Perhaps I could contribute further after the AP 90 task is complete.

On Sat, 19 Dec 2020, 22:29 Mārcis Gasūns, notifications@github.com wrote:

There appears to be more that could be done to improve Burnouf

Let a French Sanskrit scholar be born and finalize it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/CORRECTIONS/issues/420#issuecomment-748528035, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACTKJX2AW3GAK64NAKPPZJTSVULKZANCNFSM4FXVO4XQ .

gasyoun commented 3 years ago

I could contribute further after the AP 90 task is complete.

You're a true miracle, Sampada.

funderburkjim commented 3 years ago

@sanskritisampada Good idea

sanskrit-lexicon / CORRECTIONS