mir-abir-hossain / real-history-of-Bangladesh

We are taking several history books about Bangladesh and building a RAG system. The goal is to make people know any specific event of history within seconds.
MIT License
18 stars 12 forks source link

Data cleaning task for যা দেখেছি যা বুঝেছি যা করেছি - chunk 3 #14

Closed dg1223 closed 2 months ago

dg1223 commented 2 months ago

This is the data cleanup task for যা দেখেছি যা বুঝেছি যা করেছি - chunk 3.

What to do

  1. Remove all non-Bangla characters and words that do not make any sense or cannot be traced back to the original text.
  2. Fix spelling mistakes in the extracted text. You'll see that some words are incomplete and some words contain garbage characters. To fix it, you can replace it with how it is originally written in the book. As long as the texts match what's in the original book, it's good to go.
  3. Keep the metadata (page numbers, publisher's name, table of content etc.) as it is.

Links

You can download the text file from here.

After clicking the link above, click 'Download raw file' (the download icon) on the top right corner to download the file to your local machine.

Click here for the original book.

Important

Please follow the instructions in CONTRIBUTING.md to start contributiing. Do not assign the issue to yourself right away and start working on it. If you have any questions about the contribution process or you hit a roadblock, please feel free to reach out to me (@dg1223) or any other team/tech leads by commenting here or on our Whatsapp channel.

Thank you for willing to contribute to this project.

amitsingha commented 2 months ago

Thanks vai. I am continuing with it.

dg1223 commented 2 months ago

Thanks vai. I am continuing with it.

assigned you, thanks!

dg1223 commented 2 months ago

@amitsingha updated the instructions. Added information on what the deliverables are and a link to the original book.

dg1223 commented 2 months ago

@amitsingha instruction 1 updated; instruction 3 removed

amitsingha commented 2 months ago

@dg1223 No. 03 is applicable for all chunks?

dg1223 commented 2 months ago

@dg1223 No. 03 is applicable for all chunks?

yes, instruction 3 is applicable to all chunks although except for chunk 1, others only have page numbers.

one more thing: If there's a table that's not parsed properly, you should try to manually add its missing content or organize it.

amitsingha commented 2 months ago

OK.

dg1223 commented 2 months ago

Merged by #26