Preliminary questions

Check Yes or No

Has this job already been done: No
Is it more important than other work we could be doing: Yes
Would this work contribute to the mission of PechaJobs: Yes
Does it offer more business value than alternative solutions: Yes
Does it take less effort than alternative solutions: Yes

If you answered yes to all answers, continue to the request for job (RFJ).

Request for job

1. Summary

Buddha Nexus needs to OCR all Pedurma volumes with non-derge texts with different OCR models in order to generate best-possible quality generic etexts.

2. Keyword definitions

None.

3. Problem and context

The བཀའ་བསྟན་དཔེ་བསྡུར་མ། is a modern comparative edition of the Tibetan canon. It is based on the Derge Edition and also contains texts only found in other editions such as Peking and Narthang. These texts were added in several volumes throughout the canon, usually in the last volume of a section.

Buddha Nexus needs to complete its Tibetan Buddhist Canon with these texts not found in the Derge edition. We want to have the best possible quality OCR from the བཀའ་བསྟན་དཔེ་བསྡུར་མ།.

4. Job description and scope

This job should result in two outcomes:

3 or more OCRed etexts of the relevant volumes
a generic etext version created from the above sets of etexts and the pedurma etexts currently available on BDRC/OP (Namsel and an old Google OCR)

We need OCR to be done with 3 different models if they produce a different output:

model=”builtin/weekly”
model=”builtin/weekly” & language_hints=[“bo-t-i0-handwrit”]
model=”builtin/weekly” & language_hints=[“und-t-i0-handwrit”]

Note: see this article for more information

The OCR needs to be run on the following 30 volumes:

I1PD95846
I1PD95852
I1PD95853
I1PD95859
I1PD95863
I1PD95869
I1PD95870
I1PD95871
I1PD95872
I1PD95877
I1PD95878
I1PD95879
I1PD95882
I1PD95883
I1PD95884
I1PD95885
I1PD95886
I1PD95887
I1PD95888
I1PD95889
I1PD95891
I1PD95892
I1PD95893
I1PD95902
I1PD95942
I1PD95954
I1PD95959
I1PD95960
I1PD95961
I1PD95965

5. Constraints

The Pedurma notes and notes markers will remain as noise in the etexts.
We might not be able to automatically isolate every single text out of the volumes. Humans will have to step in to complete that task if a) it is required, and b) there is a budget for it.

6. Approach

Test the 3 different model options as stated above to check if they produce different output
OCR the 30 volumes
Run the generic etext creation script
Export the texts listed in the anotind.csv from the output
Create an OP collection and add a view in the releases
Make the view of the OP collection available through an API endpoint
Mail the endpoint to Orna and Sebastian

7. Other options

Manual input and proofreading is too expensive
Current OCRs are too noisy

8. Risks and unknowns

Nothing to report.

9. Goals

Etext file for each text in anotind.csv in a format to be determined with Sebastian.

[ ] Contact Sebastian to confirm the final desired format.

pechajobs / Admin

RFJ008 - Generic etext of all Pedurma volumes with non-Derge texts #39