Closed dlwh closed 1 year ago
Hi, you're in the right place! How long does it hang? I'll try to replicate your issue today
I've let it go for 10 minutes I think? Nothing's using CPU.
Ok yes I'm getting issues on my mac too, but they're different than yours 😅.
I did a fresh install of datasets and mwparserfromhell with python 3.10 on my GCP linux machine and it only took an instant to process. Investigating further. In the meantime, I've uploaded the wikipedia for language="kl", date="20221101" here for you: https://huggingface.co/datasets/Tristan/olm-wikipedia-20221101-kl-language
>>> import datasets
>>> dataset = datasets.load_dataset("olm/wikipedia", language="kl", date="20221101")
Downloading builder script: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35.8k/35.8k [00:00<00:00, 1.28MB/s]
Downloading metadata: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30.4k/30.4k [00:00<00:00, 1.04MB/s]
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.7k/12.7k [00:00<00:00, 20.0MB/s]
Using custom data configuration 20221101.kl-date=20221101,language=kl
Downloading and preparing dataset wikipedia/20221101.kl to /home/tristan_huggingface_co/.cache/huggingface/datasets/olm___wikipedia/20221101.kl-date=20221101,language=kl/2.0.0/9bbfd033392e5e52a02fe0514dcf55c0da4ba51a3ed0c390a8fe4ace3f1bd02f...
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.5k/11.5k [00:00<00:00, 20.3MB/s]
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.58it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3015.32it/s]
Dowloading Wikipedia dump
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 568k/568k [00:00<00:00, 4.99MB/s]
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.05it/s]
Finished downloading Wikipedia dump
Generating train split: 0 examples [00:00, ? examples/s]Parsing and cleaning Wikipedia examples
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 298/298 [00:00<00:00, 730.62it/s]
Parsed and cleaned Wikipedia examples█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 269/298 [00:00<00:00, 816.63it/s]
Dataset wikipedia downloaded and prepared to /home/tristan_huggingface_co/.cache/huggingface/datasets/olm___wikipedia/20221101.kl-date=20221101,language=kl/2.0.0/9bbfd033392e5e52a02fe0514dcf55c0da4ba51a3ed0c390a8fe4ace3f1bd02f. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 981.12it/s]
>>>
Hi @TristanThrush I appreciate it! I didn't actually need kl, I just looked for a small but not too small wikipedia to test this out on! Good to know it works on gcp. That where I need to actually do work, but I was testing out locally first.
Ok cool, let me know if you get problems on gcp. Will still dig into the mac thing
Alrighty re the mac thing: The issue for me was this:
multiprocessing
was not working, because pickle does not work on functions that are not at the top level of a module. It was failing because of this: https://huggingface.co/datasets/olm/wikipedia/blob/main/wikipedia.py#L1036. I have no idea why it was working on linux actually!
There was a very easy fix. Just use multiprocess
, which uses dill
, instead of multiprocessing
, which uses pickle
. I made the change here: https://huggingface.co/datasets/olm/wikipedia/blob/main/wikipedia.py#L27
Does it work for you, now? Or maybe you had a different issue? I actually got a pickle error; it wasn't just hanging for me. For larger datasets, it does hang for a little bit before you see a tqdm bar, but 10 min for kl seems like way too much.
Thanks for helping to improve this dataset
Can confirm it appears to be working now! (I had to upgrade my multiprocess install, for any future people coming here from google)
Thanks for the quick fix!
Ok sounds good. I just added the latest multiprocess version to the requirements. I'm closing this issue right now, but feel free to reopen/open a new issue if there are things I've missed
Hi,
Please let me know if I'm in the wrong place.
I was trying out the olm/wikipedia dataset on the promise that it would be a lot faster and not need Apache beam. However, it just hangs, even for a tiny language like
kl
:Running this gives:
I see two forked python processes (forked beneath another), but they're using 0% cpu. I'm on macos with Python 3.10. The main process appears to be waiting for these processes.
For comparison, the plain
"wikipedia"
model finishes in ~3 seconds if the data is already downloaded