UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1

BrownTan commented 3 weeks ago

Have some people met this problem? I know it's because of the dataset being loaded, and it's possible that it's a version issue or a gzip archive issue. But I don't know how to modify it. UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte Traceback (most recent call last): File "/xx/LLM-Pruner/hf_prune.py", line 314, in <module> main(args) File "/xx/LLM-Pruner/hf_prune.py", line 128, in main example_prompts = get_examples('bookcorpus', tokenizer, args.num_examples, seq_len = 64).to(args.device) File "/xx/LLM-Pruner/LLMPruner/datasets/example_samples.py", line 51, in get_examples return get_bookcorpus(tokenizer, n_samples, seq_len) File "/xx/LLM-Pruner/LLMPruner/datasets/example_samples.py", line 38, in get_bookcorpus i = random.randint(0, len(traindata) - 1)

BrownTan commented 2 weeks ago

Oh,my god.It's because I didn't open my clash.

VincentZ-2020 commented 2 weeks ago

Oh,my god.It's because I didn't open my clash.

I also met this, please tell me how to fix this?

BrownTan commented 2 weeks ago

Oh,my god.It's because I didn't open my clash.

I also met this, please tell me how to fix this?

You need a VPN.

VincentZ-2020 commented 2 weeks ago

Oh,my god.It's because I didn't open my clash.

I also met this, please tell me how to fix this?

You need a VPN.

Downgrading to datasets==2.14.6 is useful. Refer to this https://github.com/huggingface/datasets/issues/6760#issuecomment-2041390144

horseee / LLM-Pruner

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1 #77