rahular / varta

https://arxiv.org/abs/2305.05858
Apache License 2.0
8 stars 0 forks source link

Vārta : A Large-Scale Headline-Generation Dataset for Indic Languages

This repository contains the code and other resources for the paper published in the Findings of ACL 2023.

Dataset | Pretrained Models | Finetuning | Evaluation | Citation

Dataset

The Vārta dataset is available on the Huggingface Hub. We release train, validation, and test files in JSONL format. Each article object contains:

To recreate the dataset, follow this README file.

The train, val, and test folders contain language-specific json files and one aggregated file. However, the train folder has multiple aggregated training files for different experiments (you will have to recreate them). The data is structured as follows:

Note: if you don't want to download the whole dataset, and just want one file, you can do something like

wget https://huggingface.co/datasets/rahular/varta/raw/main/varta/<split>/langwise/<split>_<lang>.json

Pretrained Models

The code for:

Finetuning Experiments

The code for all finetuning experiments reported in the paper is placed under the baselines folder.

Evaluation

We use the multilingual variant of ROUGE implemented for the xl-sum paper for the evaluations of the headline generation and abstractive summarization tasks in our experiments.

Citation

@misc{aralikatte2023varta,
      title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages}, 
      author={Rahul Aralikatte and Ziling Cheng and Sumanth Doddapaneni and Jackie Chi Kit Cheung},
      year={2023},
      eprint={2305.05858},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}