pechajobs / Admin

1 stars 3 forks source link

RFJ008 - Generic etext of all Pedurma volumes with non-Derge texts #39

Open ngawangtrinley opened 1 year ago

ngawangtrinley commented 1 year ago

Corresponding RFC: RFC008

Client: Orna and Sebastian @ Buddha Nexus

Job manager: Tashi Tsering

Preliminary questions

Check Yes or No

  1. Has this job already been done: No
  2. Is it more important than other work we could be doing: Yes
  3. Would this work contribute to the mission of PechaJobs: Yes
  4. Does it offer more business value than alternative solutions: Yes
  5. Does it take less effort than alternative solutions: Yes

If you answered yes to all answers, continue to the request for job (RFJ).

Request for job

1. Summary

Buddha Nexus needs to OCR all Pedurma volumes with non-derge texts with different OCR models in order to generate best-possible quality generic etexts.

2. Keyword definitions

None.

3. Problem and context

The བཀའ་བསྟན་དཔེ་བསྡུར་མ། is a modern comparative edition of the Tibetan canon. It is based on the Derge Edition and also contains texts only found in other editions such as Peking and Narthang. These texts were added in several volumes throughout the canon, usually in the last volume of a section.

Buddha Nexus needs to complete its Tibetan Buddhist Canon with these texts not found in the Derge edition. We want to have the best possible quality OCR from the བཀའ་བསྟན་དཔེ་བསྡུར་མ།.

4. Job description and scope

This job should result in two outcomes:

  1. 3 or more OCRed etexts of the relevant volumes
  2. a generic etext version created from the above sets of etexts and the pedurma etexts currently available on BDRC/OP (Namsel and an old Google OCR)

We need OCR to be done with 3 different models if they produce a different output:

model=”builtin/weekly”
model=”builtin/weekly” & language_hints=[“bo-t-i0-handwrit”]
model=”builtin/weekly” & language_hints=[“und-t-i0-handwrit”]

Note: see this article for more information

The OCR needs to be run on the following 30 volumes:

I1PD95846
I1PD95852
I1PD95853
I1PD95859
I1PD95863
I1PD95869
I1PD95870
I1PD95871
I1PD95872
I1PD95877
I1PD95878
I1PD95879
I1PD95882
I1PD95883
I1PD95884
I1PD95885
I1PD95886
I1PD95887
I1PD95888
I1PD95889
I1PD95891
I1PD95892
I1PD95893
I1PD95902
I1PD95942
I1PD95954
I1PD95959
I1PD95960
I1PD95961
I1PD95965

5. Constraints

6. Approach

7. Other options

8. Risks and unknowns

Nothing to report.

9. Goals

Etext file for each text in anotind.csv in a format to be determined with Sebastian.

10. Timeline

image