sanskrit-lexicon / LRV

Convert the data of L R Vaidya Sanskrit-English dictionary to CDSL format
0 stars 0 forks source link

Duplicate page-sequence #8

Closed drdhaval2785 closed 2 years ago

drdhaval2785 commented 2 years ago

Duplicate page-sequence

@Andhrabharati has used page-sequence exactly once per new headword. The following are cases where they are used more than once. Need investigation.

035-05
133-09
227-24
646-01
717-23
120-12
149-14
153-24
153-31
169-15
182-05
182-05
190-24
190-24
195-16
199-05
205-21
217-19
219-12
226-13
230-11
237-18
241-28
256-14
270-09
270-13
276-30
278-17
287-05
293-25
295-17
306-04
307-06
307-06
309-07
309-09
312-13
312-13
312-13
317-17
321-10
326-20
335-08
343-09
345-02
345-02
347-07
361-05
367-09
369-10
372-01
392-33
397-18
398-21
399-06
399-38
430-01
438-18
438-18
441-01
443-04
443-04
450-02
453-31
460-13
460-13
487-24
489-21
489-21
490-21
503-13
519-01
521-16
521-16
538-15
567-06
567-15
567-15
568-16
572-21
572-22
596-26
596-26
601-21
605-03
605-03
605-03
650-09
653-28
671-16
671-32
672-33
693-03
694-10
707-02
711-05
722-21
748-14
752-08
752-08
761-30
765-04
769-24
776-14
776-17
786-30
792-10
797-08
797-30
798-12
836-08
838-33
836-08
839-22

Code

import codecs
import csv

if __name__ == "__main__":
    fin = codecs.open('../interim/lrv_0.txt', 'r', 'utf-8')
    reader = csv.reader(fin, delimiter='\t')
    result = set()
    for row in reader:
        pc = row[1]
        if pc != '':
            if pc in result:
                print(pc)
            result.add(pc)
drdhaval2785 commented 2 years ago

Only the following are the errors in the original data received. Rest are my artefacts.

035-05
01956   035-05  <p> अन्वेषणा    <p> अन्वेषणा    <p> अन्वेषणा    #-f.    $--See अन्वेष.  14
133-09
06536   133-09  <p> उद्भासुर    <p> उद्भासुर    <p> उद्भासुर    #-a. (f. रा)    $--Radiant, shining, splendid, /Am.S./76.   41
227-24
11337   227-24  <p> खाता    <p> खाता    <p> खाता    #-f.    $--An artificial pond.  22
646-01
33917   646-01  <p> वल्लर   <p> वल्लर   <p> वल्लर   #-n.    $--1. Aloe-wood; 2. a bower; 3. a branching footstalk.  54
717-23
37651   717-23  <p> शिखंडिन्    <p> शिखंडिन्    <p> शिखंडिन्    #-m.    $--1. A peacock, द्विधा भिन्नाः शिखंडिभिः /R./i.39, /K.S./i.15; 2. a cock; 3. an arrow; 4. a peacock’s tail; 5. an epithet of Vishṇu; 6. a kind of jasmine; 7. name of a son of Drupada. (See App. II. under अंबा.) 211
drdhaval2785 commented 2 years ago

Correcting them to be sequential. Corroborated by the following entry.

02095   037-22  <p> अपनुत्ति    <p> अपनुत्ति    <p> अपनुत्ति    #-f.    $--Removing, taking, away, e.g. पापानामपनुत्तये.    48
02096   037-23  <p> अपनोद   <p> अपनोद   <p> अपनोद   #-m.    $--See अपनुत्ति, e.g. ब्रह्महत्यापनोदाय.    40
02097   037-24  <p> अपनोदन  <p> अपनोदन  <p> अपनोदन  #-n.    $--See अपनुत्ति.    16

Everyone has been given the next sequence number.

Andhrabharati commented 2 years ago

Glad that not many errors are in my text.

Also seen some mAtrA corrections in the other issue.

Andhrabharati commented 2 years ago

Though not related to this issue, I saw that the meaning portions starting with Roman numbers are made as new entries.

I would suggest them to be clubbed together as single entry body content as in AP90.

Andhrabharati commented 2 years ago

@drdhaval2785

Just looked into my LRV file once; the Roman numbered lines mostly indicate lex. change for the HW; though in few instances a 'sense' change is denoted by it. So, you may take appropriate action on 'handling' these. [They should not be made as separate entries, in my opinion.]

Also seen that you had included the last column data (of the excel file), into your text file. That has nothing to do in the text file and is to be removed; it was just kept as a body column length count in my excel file.

drdhaval2785 commented 2 years ago

Sure. I will keep in mind. I have not removed the last column. But the file generated from your data will not contain that column.

drdhaval2785 commented 2 years ago

No duplicate page-sequences now.