google-research-datasets / wiki-atomic-edits

A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.
106 stars 8 forks source link

English portion of the dataset seems to be corrupted #1

Closed pcyin closed 6 years ago

pcyin commented 6 years ago

Hi,

I tried to unzip the English portion of the dataset:

wget https://storage.googleapis.com/wiki-atomic-edits/english/insertions.tsv.gz
zcat insertions.tsv.gz > insertions.tsv

And it gave an error:

gzip: insertions.tsv.gz: invalid compressed data--format violated

The data files for other languages (e.g., German) seem to be fine.

mfaruqui commented 6 years ago

Could you please re download and confirm if the files are looking fine. I have fixed this for English and proceeding to other languages. Thanks for reporting the problem!

pcyin commented 6 years ago

It works! Thanks so much!