Sole occurrence of § - Githubissues

drdhaval2785 commented 7 years ago

Lnum 20045 <HI1>Uttara. Only the first § agrees with the Jābāla. Non-UTF8 character.

funderburkjim commented 7 years ago

it IS a UTF8 character. The file pywork/check_ea1.txt (or also, pywork/acc-meta2.txt) gives a list of all extended ascii characters in the digitization. This particular one is, and it does appear in the printed text, under headword rAmatApanIyopanizad

§  (\u00a7)     1 := SECTION SIGN

There are a few other characters that appear only once -

«  (\u00ab)     1 := LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
»  (\u00bb)     1 := RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
The above two occur in digitization of preface material titling -- probably no need to modify 

¯  (\u00af)     1 := MACRON
150359  <L>34131<pc>2-064,2<k1>nibanDasarvasva<k2>nibanDasa¯rvasva      PROBABLE ERROR
150360 {#nibanDasa¯rvasva#}¦ <ab type="subj">dh</ab>. by Mahādeva, <ab type="pers">son</ab> of Śrīpati. Sūcīpu-

„  (\u201e)     1 := DOUBLE LOW-9 QUOTATION MARK    PROBABLE ERROR
<L>42183<pc>3-013,1<k1>ASvalAyana<k2>ASvalAyana

184231 <HI1>C. by Viṣṇugūḍha („Uttaraṣaṭkaprayoga-
184232 <>paddhati”). <ls>AS</ls> p. 27.

and a few that appear a small number of times.

drdhaval2785 commented 7 years ago

I understand that it is UTF8 character. But I feel that the current version in file is HTM encoded hex, and not utf8.

HTML Entity (hex) § instead of U+00A7.

I came to notice because decode('utf-8') threw error in python.

funderburkjim commented 7 years ago

I'd have to see the full context of the decode error you mention in order to reproduce the error.

I think it is properly coded with utf-8 coding in acc digitizations. Otherwise, when we read the file opend with codecs.open('acc5.txt','r','utf-8'), there would be an error when reading the line in question.

Please send code with full example illustrating the decode error -- we need to fully understand the concern you are raising and not leave it dangling.

drdhaval2785 commented 7 years ago

Ok. I have stripped hash and ampersand. So maybe that may convert it to the present form. It is not at all important. Now I am storing it to the output with > . So the question doesnt survive.

sanskrit-lexicon / ACC

Sole occurrence of § #8