mesolitica / malaysian-dataset

We gather Malaysian dataset! https://malaysian-dataset.readthedocs.io/
https://malaysian-dataset.readthedocs.io/
Apache License 2.0
297 stars 106 forks source link

update notebooks maljnutr #391

Closed mindscapexyz closed 7 months ago

mindscapexyz commented 7 months ago

there was error in the script when retrieving the pdfs, some of the pdfs have common names (ie. a.pdf, b.pdf), so I added volume no as suffix to make them unique.

Datasets and pdfs were also updated at https://huggingface.co/datasets/hlmshkr/maljnutr-pdfs/tree/main

Extracted text in jsonl are now around 29MBs (previously 11MBs) and there are total of 965 PDFs