huggingface / olm-datasets

Pipeline for pulling and processing online language model pretraining data from the web
Apache License 2.0
173 stars 23 forks source link

olm/wikipedia hangs on tiny wikipedia language #3

Closed dlwh closed 1 year ago

dlwh commented 1 year ago

Hi,

Please let me know if I'm in the wrong place.

I was trying out the olm/wikipedia dataset on the promise that it would be a lot faster and not need Apache beam. However, it just hangs, even for a tiny language like kl:

lang = "kl"
dataset = datasets.load_dataset("olm/wikipedia", language="kl", date="20221101")
with open(f"wiki_{lang}_sample.jsonl", "w") as f:
    for i in range(100):
        f.write(json.dumps(dataset["train"][i]))
print("All done!")

Running this gives:

Using custom data configuration 20221101.kl-date=20221101,language=kl
Downloading and preparing dataset wikipedia/20221101.kl to /Users/dlwh/.cache/huggingface/datasets/olm___wikipedia/20221101.kl-date=20221101,language=kl/2.0.0/9bbfd033392e5e52a02fe0514dcf55c0da4ba51a3ed0c390a8fe4ace3f1bd02f...
Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 2832.08it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 1715.46it/s]
Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 8924.05it/s]
Generating train split: 0 examples [00:00, ? examples/s]Dowloading Wikipedia dump
Finished downloading Wikipedia dump
Parsing and cleaning Wikipedia examples
Using custom data configuration 20221101.kl-date=20221101,language=kl

I see two forked python processes (forked beneath another), but they're using 0% cpu. I'm on macos with Python 3.10. The main process appears to be waiting for these processes.

For comparison, the plain "wikipedia" model finishes in ~3 seconds if the data is already downloaded

TristanThrush commented 1 year ago

Hi, you're in the right place! How long does it hang? I'll try to replicate your issue today

dlwh commented 1 year ago

I've let it go for 10 minutes I think? Nothing's using CPU.

TristanThrush commented 1 year ago

Ok yes I'm getting issues on my mac too, but they're different than yours 😅.

I did a fresh install of datasets and mwparserfromhell with python 3.10 on my GCP linux machine and it only took an instant to process. Investigating further. In the meantime, I've uploaded the wikipedia for language="kl", date="20221101" here for you: https://huggingface.co/datasets/Tristan/olm-wikipedia-20221101-kl-language

>>> import datasets
>>> dataset = datasets.load_dataset("olm/wikipedia", language="kl", date="20221101")
Downloading builder script: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35.8k/35.8k [00:00<00:00, 1.28MB/s]
Downloading metadata: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30.4k/30.4k [00:00<00:00, 1.04MB/s]
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.7k/12.7k [00:00<00:00, 20.0MB/s]
Using custom data configuration 20221101.kl-date=20221101,language=kl
Downloading and preparing dataset wikipedia/20221101.kl to /home/tristan_huggingface_co/.cache/huggingface/datasets/olm___wikipedia/20221101.kl-date=20221101,language=kl/2.0.0/9bbfd033392e5e52a02fe0514dcf55c0da4ba51a3ed0c390a8fe4ace3f1bd02f...
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.5k/11.5k [00:00<00:00, 20.3MB/s]
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.58it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3015.32it/s]
Dowloading Wikipedia dump
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 568k/568k [00:00<00:00, 4.99MB/s]
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.05it/s]
Finished downloading Wikipedia dump
Generating train split: 0 examples [00:00, ? examples/s]Parsing and cleaning Wikipedia examples
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 298/298 [00:00<00:00, 730.62it/s]
Parsed and cleaned Wikipedia examples█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                 | 269/298 [00:00<00:00, 816.63it/s]
Dataset wikipedia downloaded and prepared to /home/tristan_huggingface_co/.cache/huggingface/datasets/olm___wikipedia/20221101.kl-date=20221101,language=kl/2.0.0/9bbfd033392e5e52a02fe0514dcf55c0da4ba51a3ed0c390a8fe4ace3f1bd02f. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 981.12it/s]
>>> 
dlwh commented 1 year ago

Hi @TristanThrush I appreciate it! I didn't actually need kl, I just looked for a small but not too small wikipedia to test this out on! Good to know it works on gcp. That where I need to actually do work, but I was testing out locally first.

TristanThrush commented 1 year ago

Ok cool, let me know if you get problems on gcp. Will still dig into the mac thing

TristanThrush commented 1 year ago

Alrighty re the mac thing: The issue for me was this:

multiprocessing was not working, because pickle does not work on functions that are not at the top level of a module. It was failing because of this: https://huggingface.co/datasets/olm/wikipedia/blob/main/wikipedia.py#L1036. I have no idea why it was working on linux actually!

There was a very easy fix. Just use multiprocess, which uses dill, instead of multiprocessing, which uses pickle. I made the change here: https://huggingface.co/datasets/olm/wikipedia/blob/main/wikipedia.py#L27

Does it work for you, now? Or maybe you had a different issue? I actually got a pickle error; it wasn't just hanging for me. For larger datasets, it does hang for a little bit before you see a tqdm bar, but 10 min for kl seems like way too much.

Thanks for helping to improve this dataset

dlwh commented 1 year ago

Can confirm it appears to be working now! (I had to upgrade my multiprocess install, for any future people coming here from google)

Thanks for the quick fix!

TristanThrush commented 1 year ago

Ok sounds good. I just added the latest multiprocess version to the requirements. I'm closing this issue right now, but feel free to reopen/open a new issue if there are things I've missed