vis-nlp / UniChart

MIT License
52 stars 6 forks source link

About processing pretraining dataset #10

Closed dydxdt closed 1 week ago

dydxdt commented 6 months ago

Great Job! I want to use your dataset for my task and I download the parquet files here: https://huggingface.co/datasets/ahmed-masry/unichart-pretrain-data/tree/main/data. So how do I process the query-label pair? Since I find that after I extract content from 3 parquet files and save them in the name of "{imgname}.txt" respectively, there are repetitive files in the corresponding three directories. So can you tell me how to save the QA pairs in a right way? Thx!

AhmedMasryKU commented 3 months ago

Hey @dydxdt I added a model card explaining how to load the data on huggingface: https://huggingface.co/datasets/ahmed-masry/unichart-pretrain-data/blob/main/README.md