mir-abir-hossain / real-history-of-Bangladesh

We are taking several history books about Bangladesh and building a RAG system. The goal is to make people know any specific event of history within seconds.
MIT License
18 stars 12 forks source link

Add feature to extract English texts along with Bangla #23

Closed dg1223 closed 3 days ago

dg1223 commented 2 weeks ago

After the text extraction code is pushed to the dev branch (#22), we should add English text extraction capability. The code currently extracts Bangla texts only, but some books have both Bangla and English texts, as noted by @svefn-g-englar.

The lang parameter may need to be updated:

extract_text_from_scanned_pdf(pdf_path, lang='ben'...)

To Do

  1. Update the code to extract text in two modes: lang='ben' and lang='eng+ben', generating two text files.
  2. You can generate both text files or provide the user the option (interactive input) to extract text with ben, eng+ben, or both. It'd be your discretion to choose what to go for.

Why

It might make QAing easier by giving us different baselines to compare issues with text extraction.

dg1223 commented 2 weeks ago

@Irtizaya updated the issue description

dg1223 commented 3 days ago

Closing this because it's redundant. Work was already done in #25.