We are taking several history books about Bangladesh and building a RAG system. The goal is to make people know any specific event of history within seconds.
MIT License
18
stars
12
forks
source link
Data cleaning task for যা দেখেছি যা বুঝেছি যা করেছি - chunk 7 #18
This is the data cleanup task for যা দেখেছি যা বুঝেছি যা করেছি - chunk 7.
What to do
Remove all non-Bangla characters and words that do not make any sense or cannot be traced back to the original text.
Fix spelling mistakes in the extracted text. You'll see that some words are incomplete and some words contain garbage characters. To fix it, you can replace it with how it is originally written in the book. As long as the texts match what's in the original book, it's good to go.
Keep the metadata (page numbers, publisher's name, table of content etc.) as it is.
Please follow the instructions in CONTRIBUTING.md to start contributiing. Do not assign the issue to yourself right away and start working on it. If you have any questions about the contribution process or you hit a roadblock, please feel free to reach out to me (@dg1223) or any other team/tech leads by commenting here or on our Whatsapp channel.
Thank you for willing to contribute to this project.
This is the data cleanup task for যা দেখেছি যা বুঝেছি যা করেছি - chunk 7.
What to do
Links
You can download the text file from here.
After clicking the link above, click 'Download raw file' (the download icon) on the top right corner to download the file to your local machine.
Click here for the original book.
Important
Please follow the instructions in CONTRIBUTING.md to start contributiing. Do not assign the issue to yourself right away and start working on it. If you have any questions about the contribution process or you hit a roadblock, please feel free to reach out to me (@dg1223) or any other team/tech leads by commenting here or on our Whatsapp channel.
Thank you for willing to contribute to this project.