Is it more important than other work we could be doing: Yes
Would this work contribute to the mission of PechaJobs: Yes
Does it offer more business value than alternative solutions: Yes
Does it take less effort than alternative solutions: Yes
If you answered yes to all answers, continue to the request for job (RFJ).
Request for job
1. Summary
Buddha Nexus needs to OCR all Pedurma volumes with non-derge texts with different OCR models in order to generate best-possible quality generic etexts.
2. Keyword definitions
None.
3. Problem and context
The བཀའ་བསྟན་དཔེ་བསྡུར་མ། is a modern comparative edition of the Tibetan canon. It is based on the Derge Edition and also contains texts only found in other editions such as Peking and Narthang. These texts were added in several volumes throughout the canon, usually in the last volume of a section.
Buddha Nexus needs to complete its Tibetan Buddhist Canon with these texts not found in the Derge edition. We want to have the best possible quality OCR from the བཀའ་བསྟན་དཔེ་བསྡུར་མ།.
4. Job description and scope
This job should result in two outcomes:
3 or more OCRed etexts of the relevant volumes
a generic etext version created from the above sets of etexts and the pedurma etexts currently available on BDRC/OP (Namsel and an old Google OCR)
We need OCR to be done with 3 different models if they produce a different output:
The Pedurma notes and notes markers will remain as noise in the etexts.
We might not be able to automatically isolate every single text out of the volumes. Humans will have to step in to complete that task if a) it is required, and b) there is a budget for it.
6. Approach
Test the 3 different model options as stated above to check if they produce different output
OCR the 30 volumes
Run the generic etext creation script
Export the texts listed in the anotind.csv from the output
Create an OP collection and add a view in the releases
Make the view of the OP collection available through an API endpoint
Mail the endpoint to Orna and Sebastian
7. Other options
Manual input and proofreading is too expensive
Current OCRs are too noisy
8. Risks and unknowns
Nothing to report.
9. Goals
Etext file for each text in anotind.csv in a format to be determined with Sebastian.
[ ] Contact Sebastian to confirm the final desired format.
Corresponding RFC: RFC008
Client: Orna and Sebastian @ Buddha Nexus
Job manager: Tashi Tsering
Preliminary questions
Check Yes or No
If you answered yes to all answers, continue to the request for job (RFJ).
Request for job
1. Summary
Buddha Nexus needs to OCR all Pedurma volumes with non-derge texts with different OCR models in order to generate best-possible quality generic etexts.
2. Keyword definitions
None.
3. Problem and context
The བཀའ་བསྟན་དཔེ་བསྡུར་མ། is a modern comparative edition of the Tibetan canon. It is based on the Derge Edition and also contains texts only found in other editions such as Peking and Narthang. These texts were added in several volumes throughout the canon, usually in the last volume of a section.
Buddha Nexus needs to complete its Tibetan Buddhist Canon with these texts not found in the Derge edition. We want to have the best possible quality OCR from the བཀའ་བསྟན་དཔེ་བསྡུར་མ།.
4. Job description and scope
This job should result in two outcomes:
We need OCR to be done with 3 different models if they produce a different output:
Note: see this article for more information
The OCR needs to be run on the following 30 volumes:
5. Constraints
6. Approach
7. Other options
8. Risks and unknowns
Nothing to report.
9. Goals
Etext file for each text in anotind.csv in a format to be determined with Sebastian.
10. Timeline