mir-abir-hossain / real-history-of-Bangladesh

We are taking several history books about Bangladesh and building a RAG system. The goal is to make people know any specific event of history within seconds.
MIT License
18 stars 12 forks source link

Data cleaning task for যা দেখেছি যা বুঝেছি যা করেছি - chunk 7 #18

Open dg1223 opened 2 months ago

dg1223 commented 2 months ago

This is the data cleanup task for যা দেখেছি যা বুঝেছি যা করেছি - chunk 7.

What to do

  1. Remove all non-Bangla characters and words that do not make any sense or cannot be traced back to the original text.
  2. Fix spelling mistakes in the extracted text. You'll see that some words are incomplete and some words contain garbage characters. To fix it, you can replace it with how it is originally written in the book. As long as the texts match what's in the original book, it's good to go.
  3. Keep the metadata (page numbers, publisher's name, table of content etc.) as it is.

Links

You can download the text file from here.

After clicking the link above, click 'Download raw file' (the download icon) on the top right corner to download the file to your local machine.

Click here for the original book.

Important

Please follow the instructions in CONTRIBUTING.md to start contributiing. Do not assign the issue to yourself right away and start working on it. If you have any questions about the contribution process or you hit a roadblock, please feel free to reach out to me (@dg1223) or any other team/tech leads by commenting here or on our Whatsapp channel.

Thank you for willing to contribute to this project.

SirazSium84 commented 2 months ago

@dg1223 Bhai can you assign me to this task please

dg1223 commented 2 months ago

@dg1223 Bhai can you assign me to this task please

done

dg1223 commented 1 month ago

PR #37