huggingface / course

The Hugging Face course on Transformers
https://huggingface.co/course
Apache License 2.0
2.15k stars 704 forks source link

Creating your own dataset load_dataset issue #692

Open fancellu opened 6 months ago

fancellu commented 6 months ago

https://huggingface.co/learn/nlp-course/chapter5/5?fw=pt

https://discuss.huggingface.co/t/chapter-5-questions/11744/83?u=fancellu

issues_dataset = load_dataset("json", data_files="datasets-issues.jsonl", split="train")

barfs with

TypeError: Couldn't cast array of type timestamp[s] to null

Someone else saw the same too in Sept 2023

fancellu commented 6 months ago

When I split into 1k line files, and run load_dataset on each, it all works fine!

To make this easier to solve, here is my poison payload, zipped up

datasets-issues.zip

fancellu commented 6 months ago

Also, if I remove pull_requests from the json, the filtered jsonl loads just fine too. e.g.

import json

filtered_lines = []
with open("datasets-issues.jsonl", "r") as f:  
  for line in f:    
    data = json.loads(line.strip())  # Parse each line as JSON
    if not data.get("pull_request"):  # Check if "pull_request" key is absent
      filtered_lines.append(line)

# Write the filtered lines to a new file
with open("filtered_jsonl.jsonl", "w") as f:
  f.writelines(filtered_lines)