Add feature to extract English texts along with Bangla

dg1223 commented 2 weeks ago

After the text extraction code is pushed to the dev branch (#22), we should add English text extraction capability. The code currently extracts Bangla texts only, but some books have both Bangla and English texts, as noted by @svefn-g-englar.

The lang parameter may need to be updated:

extract_text_from_scanned_pdf(pdf_path, lang='ben'...)

To Do

Update the code to extract text in two modes: lang='ben' and lang='eng+ben', generating two text files.
You can generate both text files or provide the user the option (interactive input) to extract text with ben, eng+ben, or both. It'd be your discretion to choose what to go for.

Why

It might make QAing easier by giving us different baselines to compare issues with text extraction.

dg1223 commented 2 weeks ago

@Irtizaya updated the issue description

dg1223 commented 3 days ago

Closing this because it's redundant. Work was already done in #25.

mir-abir-hossain / real-history-of-Bangladesh

Add feature to extract English texts along with Bangla #23

To Do

Why