kevinyaobytedance / llm_unlearn

LLM Unlearning
MIT License
125 stars 16 forks source link

Issues with running the script #6

Open projectavi opened 3 months ago

projectavi commented 3 months ago

When I try running unlearn_harm.py I get the following error:

pyarrow.lib.ArrowInvalid: Column 5 named input_ids expected length 1000 but got length 1096

after replacing load_dataset's split with "train" because the original split did not exist.

qzc438 commented 2 months ago

I have the same problem. Do you have any updates?

Xiang-Pan commented 2 months ago

I met the same problem: There is some issue with the map, if the map input rows/batch is 1000 and output is 1096 rows, then the problem popup, one walkaround is writing the map manully.

dataloader = DataLoader(dataset, batch_size=1000)
d = {}
d["input_ids"] = []
d["attention_mask"] = []
d["start_locs"] = []
for batch in tqdm(dataloader):
    p_batch = preproccess(batch)
    d["input_ids"].extend(p_batch["input_ids"])
    d["attention_mask"].extend(p_batch["attention_mask"])
    d["start_locs"].extend(p_batch["start_locs"])
dataset = Dataset.from_dict(d)