Open fancellu opened 6 months ago
When I split into 1k line files, and run load_dataset on each, it all works fine!
To make this easier to solve, here is my poison payload, zipped up
Also, if I remove pull_requests from the json, the filtered jsonl loads just fine too. e.g.
import json
filtered_lines = []
with open("datasets-issues.jsonl", "r") as f:
for line in f:
data = json.loads(line.strip()) # Parse each line as JSON
if not data.get("pull_request"): # Check if "pull_request" key is absent
filtered_lines.append(line)
# Write the filtered lines to a new file
with open("filtered_jsonl.jsonl", "w") as f:
f.writelines(filtered_lines)
https://huggingface.co/learn/nlp-course/chapter5/5?fw=pt
https://discuss.huggingface.co/t/chapter-5-questions/11744/83?u=fancellu
issues_dataset = load_dataset("json", data_files="datasets-issues.jsonl", split="train")
barfs with
Someone else saw the same too in Sept 2023