Closed drdhaval2785 closed 7 years ago
it IS a UTF8 character. The file pywork/check_ea1.txt (or also, pywork/acc-meta2.txt) gives a list of all extended ascii characters in the digitization. This particular one is, and it does appear in the printed text, under headword rAmatApanIyopanizad
§ (\u00a7) 1 := SECTION SIGN
There are a few other characters that appear only once -
« (\u00ab) 1 := LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
» (\u00bb) 1 := RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
The above two occur in digitization of preface material titling -- probably no need to modify
¯ (\u00af) 1 := MACRON
150359 <L>34131<pc>2-064,2<k1>nibanDasarvasva<k2>nibanDasa¯rvasva PROBABLE ERROR
150360 {#nibanDasa¯rvasva#}¦ <ab type="subj">dh</ab>. by Mahādeva, <ab type="pers">son</ab> of Śrīpati. Sūcīpu-
„ (\u201e) 1 := DOUBLE LOW-9 QUOTATION MARK PROBABLE ERROR
<L>42183<pc>3-013,1<k1>ASvalAyana<k2>ASvalAyana
184231 <HI1>C. by Viṣṇugūḍha („Uttaraṣaṭkaprayoga-
184232 <>paddhati”). <ls>AS</ls> p. 27.
and a few that appear a small number of times.
I understand that it is UTF8 character. But I feel that the current version in file is HTM encoded hex, and not utf8.
HTML Entity (hex) §
instead of U+00A7
.
I came to notice because decode('utf-8') threw error in python.
I'd have to see the full context of the decode error you mention in order to reproduce the error.
I think it is properly coded with utf-8 coding in acc digitizations. Otherwise, when we read the file
opend with codecs.open('acc5.txt','r','utf-8')
, there would be an error when reading the line in question.
Please send code with full example illustrating the decode error -- we need to fully understand the concern you are raising and not leave it dangling.
Ok. I have stripped hash and ampersand. So maybe that may convert it to the present form. It is not at all important. Now I am storing it to the output with > . So the question doesnt survive.
Lnum 20045
<HI1>Uttara. Only the first § agrees with the Jābāla.
Non-UTF8 character.