Closed mindscapexyz closed 7 months ago
there was error in the script when retrieving the pdfs, some of the pdfs have common names (ie. a.pdf, b.pdf), so I added volume no as suffix to make them unique.
Datasets and pdfs were also updated at https://huggingface.co/datasets/hlmshkr/maljnutr-pdfs/tree/main
Extracted text in jsonl are now around 29MBs (previously 11MBs) and there are total of 965 PDFs
there was error in the script when retrieving the pdfs, some of the pdfs have common names (ie. a.pdf, b.pdf), so I added volume no as suffix to make them unique.
Datasets and pdfs were also updated at https://huggingface.co/datasets/hlmshkr/maljnutr-pdfs/tree/main
Extracted text in jsonl are now around 29MBs (previously 11MBs) and there are total of 965 PDFs