We are taking several history books about Bangladesh and building a RAG system. The goal is to make people know any specific event of history within seconds.
MIT License
18
stars
12
forks
source link
Add feature to extract English texts along with Bangla #23
After the text extraction code is pushed to the dev branch (#22), we should add English text extraction capability. The code currently extracts Bangla texts only, but some books have both Bangla and English texts, as noted by @svefn-g-englar.
Update the code to extract text in two modes: lang='ben' and lang='eng+ben', generating two text files.
You can generate both text files or provide the user the option (interactive input) to extract text with ben, eng+ben, or both. It'd be your discretion to choose what to go for.
Why
It might make QAing easier by giving us different baselines to compare issues with text extraction.
After the text extraction code is pushed to the dev branch (#22), we should add English text extraction capability. The code currently extracts Bangla texts only, but some books have both Bangla and English texts, as noted by @svefn-g-englar.
The
lang
parameter may need to be updated:extract_text_from_scanned_pdf(pdf_path, lang='ben'...)
To Do
Why
It might make QAing easier by giving us different baselines to compare issues with text extraction.