huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.26k stars 2.7k forks source link

ModuleNotFoundError: No module named 'apache_beam', when specific languages. #1790

Open miyamonz opened 3 years ago

miyamonz commented 3 years ago
import datasets
wiki = datasets.load_dataset('wikipedia', '20200501.ja', cache_dir='./datasets')

then ModuleNotFoundError: No module named 'apache_beam' happend.

The error doesn't appear when it's '20200501.en'. I don't know Apache Beam, but according to #498 it isn't necessary when it's saved to local. is it correct?

lhoestq commented 3 years ago

Hi !

Apache Beam is a framework used to define data transformation pipelines. These pipeline can then be run in many runtimes: DataFlow, Spark, Flink, etc. There also exist a local runner called the DirectRunner. Wikipedia is a dataset that requires some parsing, so to allow the processing to be run on this kind of runtime we're using Apache Beam.

At Hugging Face we've already processed certain versions of wikipedia (the 20200501.en one for example) so that users can directly download the processed version instead of using Apache Beam to process it. However for the japanese language we haven't processed it so you'll have to run the processing on your side. So you do need Apache Beam to process 20200501.ja.

You can install Apache Beam with

pip install apache-beam

I think we can probably improve the error message to let users know of this subtlety. What #498 implied is that Apache Beam is not needed when you process a dataset that doesn't use Apache Beam.

miyamonz commented 3 years ago

Thanks for your reply! I understood.

I tried again with installing apache-beam, add beam_runner="DirectRunner" and an anther mwparserfromhell is also required so I installed it. but, it also failed. It exited 1 without error message.

import datasets
# BTW, 20200501.ja doesn't exist at wikipedia, so I specified date argument
wiki = datasets.load_dataset("wikipedia", language="ja", date="20210120", cache_dir="./datasets", beam_runner="DirectRunner")
print(wiki)

and its log is below

Using custom data configuration 20210120.ja
Downloading and preparing dataset wikipedia/20210120.ja-date=20210120,language=ja (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to ./datasets/wikipedia/20210120.ja-date=20210120,language=ja/0.0.0/4021357e28509391eab2f8300d9b689e7e8f3a877ebb3d354b01577d497ebc63...
Killed

I also tried on another machine because it may caused by insufficient resources.

$ python main.py
Using custom data configuration 20210120.ja
Downloading and preparing dataset wikipedia/20210120.ja-date=20210120,language=ja (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to ./datasets/wikipedia/20210120.ja-date=20210120,language=ja/0.0.0/4021357e28509391eab2f8300d9b689e7e8f3a877ebb3d354b01577d497ebc63...

Traceback (most recent call last):
  File "main.py", line 3, in <module>
    wiki = datasets.load_dataset("wikipedia", language="ja", date="20210120", cache_dir="./datasets", beam_runner="DirectRunner")
  File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/datasets/load.py", line 609, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/datasets/builder.py", line 526, in download_and_prepare
    self._download_and_prepare(
  File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/datasets/builder.py", line 1069, in _download_and_prepare
    pipeline_results = pipeline.run()
  File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/pipeline.py", line 561, in run
    return self.runner.run_pipeline(self, self._options)
  File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/direct/direct_runner.py", line 126, in run_pipeline
    return runner.run_pipeline(pipeline, options)
  File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 182, in run_pipeline
    self._latest_run_result = self.run_via_runner_api(
  File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 193, in run_via_runner_api
    return self.run_stages(stage_context, stages)
  File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 358, in run_stages
    stage_results = self._run_stage(
  File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 549, in _run_stage
    last_result, deferred_inputs, fired_timers = self._run_bundle(
  File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 595, in _run_bundle
    result, splits = bundle_manager.process_bundle(
  File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 888, in process_bundle
    self._send_input_to_worker(process_bundle_id, transform_id, elements)
  File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 765, in _send_input_to_worker
    data_out.write(byte_stream)
  File "apache_beam/coders/stream.pyx", line 42, in apache_beam.coders.stream.OutputStream.write
  File "apache_beam/coders/stream.pyx", line 47, in apache_beam.coders.stream.OutputStream.write
  File "apache_beam/coders/stream.pyx", line 109, in apache_beam.coders.stream.OutputStream.extend
AssertionError: OutputStream realloc failed.
gchhablani commented 3 years ago

Hi @miyamonz,

I tried replicating this issue using the same snippet used by you. I am able to download the dataset without any issues, although I stopped it in the middle because the dataset is huge.

Based on a similar issue here, it could be related to your environment setup, although I am just guessing here. Can you share these details?

miyamonz commented 3 years ago

thanks for your reply and sorry for my late response.

environment

my local machine environment info

lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal

RTX 2070 super Inside WSL, there is no nvidia-msi command. I don't know why. But, torch.cuda.is_available() is true and when I start something ML training code GPU usage is growing up, so I think it works.

From PowerShell, there is nvidia-smi.exe and result is below.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.05       Driver Version: 470.05       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:09:00.0  On |                  N/A |
|  0%   30C    P8    19W / 175W |    523MiB /  8192MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1728    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A      3672    C+G   ...ekyb3d8bbwe\YourPhone.exe    N/A      |
|    0   N/A  N/A      6304    C+G   ...2txyewy\TextInputHost.exe    N/A      |
|    0   N/A  N/A      8648    C+G   C:\Windows\explorer.exe         N/A      |
|    0   N/A  N/A      9536    C+G   ...y\ShellExperienceHost.exe    N/A      |
|    0   N/A  N/A     10668    C+G   ...5n1h2txyewy\SearchApp.exe    N/A      |
|    0   N/A  N/A     10948    C+G   ...artMenuExperienceHost.exe    N/A      |
|    0   N/A  N/A     11988    C+G   ...8wekyb3d8bbwe\Cortana.exe    N/A      |
|    0   N/A  N/A     12464    C+G   ...cw5n1h2txyewy\LockApp.exe    N/A      |
|    0   N/A  N/A     13280    C+G   ...upport\CEF\Max Helper.exe    N/A      |
|    0   N/A  N/A     15948    C+G   ...t\GoogleIMEJaRenderer.exe    N/A      |
|    0   N/A  N/A     16128    C+G   ...ram Files\Slack\Slack.exe    N/A      |
|    0   N/A  N/A     19096    C+G   ...8bbwe\WindowsTerminal.exe    N/A      |
+-----------------------------------------------------------------------------+

I don't know what should I show in such a case. If it's not enough, please tell me some commands.


what I did

I surveyed more and I found 2 issues.

About the first one, I wrote it as a new issue. https://github.com/huggingface/datasets/issues/2031

The error I mentioned in the previous comment above, which occurred on my local machine, is no longer occurring.

But, it still failed. In the previous comment, I wrote AssertionError: OutputStream realloc failed. happen on another machine. It also happens on my local machine.

Here's what I've tried.

the wikipedia.py downloads these xml.bz2 files based on dumpstatus.json In Japanese Wikipedia dataset that I specified, it will download these 6 files.

https://dumps.wikimedia.org/jawiki/20210120/dumpstatus.json and filtered json based on wikipedia.py is below.

 {
   "jobs": {
     "articlesmultistreamdump": {
       "files": {
         "jawiki-20210120-pages-articles-multistream1.xml-p1p114794.bz2": {
           "url": "/jawiki/20210120/jawiki-20210120-pages-articles-multistream1.xml-p1p114794.bz2"
         },
         "jawiki-20210120-pages-articles-multistream2.xml-p114795p390428.bz2": {
           "url": "/jawiki/20210120/jawiki-20210120-pages-articles-multistream2.xml-p114795p390428.bz2"
         },
         "jawiki-20210120-pages-articles-multistream3.xml-p390429p902407.bz2": {
           "url": "/jawiki/20210120/jawiki-20210120-pages-articles-multistream3.xml-p390429p902407.bz2"
         },
         "jawiki-20210120-pages-articles-multistream4.xml-p902408p1721646.bz2": {
           "url": "/jawiki/20210120/jawiki-20210120-pages-articles-multistream4.xml-p902408p1721646.bz2"
         },
         "jawiki-20210120-pages-articles-multistream5.xml-p1721647p2807947.bz2": {
           "url": "/jawiki/20210120/jawiki-20210120-pages-articles-multistream5.xml-p1721647p2807947.bz2"
         },
         "jawiki-20210120-pages-articles-multistream6.xml-p2807948p4290013.bz2": {
           "url": "/jawiki/20210120/jawiki-20210120-pages-articles-multistream6.xml-p2807948p4290013.bz2"
         }
       }
     }
   }
 }

So, I tried running with fewer resources by modifying this line. https://github.com/huggingface/datasets/blob/13a5b7db992ad5cf77895e4c0f76595314390418/datasets/wikipedia/wikipedia.py#L524 I changed it like this. just change filepaths list. | "Initialize" >> beam.Create(filepaths[:1])

and I added a print line inside for the loop of _extract_content. like this if(i % 100000 == 0): print(i)

first, without modification, it always stops after all _extract_content is done.

my opinion

It seems it's successful when the entire file size is small.

so, at least it doesn't file-specific issue.

I don't know it's true but I think when beam_writter writes into a file, it consumes memory depends on its entire file. but It's correct Apache Beam's behavior? I'm not familiar with this library.

lhoestq commented 3 years ago

I don't know if this is related, but there is this issue on the wikipedia processing that you reported at #2031 (open PR is at #2037 ) . Does the fix your proposed at #2037 helps in your case ?

And for information, the DirectRunner of Apache Beam is not optimized for memory intensive tasks, so you must be right when you say that it uses the memory for the entire file.

miyamonz commented 3 years ago

the #2037 doesn't solve my problem directly, but I found the point!

https://github.com/huggingface/datasets/blob/349ac4398a3bcae6356f14c5754483383a60e8a4/datasets/wikipedia/wikipedia.py#L523 this beam.transforms.Reshuffle() cause the memory error.

it makes sense if I consider the shuffle means. Beam's reshuffle seems need put all data in memory. Previously I doubt that this line causes error, but at that time another bug showed in #2037 made error, so I can't found it.

Anyway, I comment out this line, and run load_dataset, then it works!

wiki = datasets.load_dataset(
    "./wikipedia.py",
    cache_dir="./datasets",
    beam_runner="DirectRunner",
    language="ja",
    date="20210120",
)["train"]

image

Dataset has already shuffle function. https://github.com/huggingface/datasets/blob/349ac4398a3bcae6356f14c5754483383a60e8a4/src/datasets/arrow_dataset.py#L2069 So, though I don't know it's difference correctly, but I think Beam's reshuffle isn't be needed. How do you think?

lhoestq commented 3 years ago

The reshuffle is needed when you use parallelism. The objective is to redistribute the articles evenly on the workers, since the _extract_content step generated many articles per file. By using reshuffle, we can split the processing of the articles of one file into several workers. Without reshuffle, all the articles of one file would be processed on the same worker that read the file, making the whole process take a very long time.

lhoestq commented 3 years ago

Maybe the reshuffle step can be added only if the runner is not a DirectRunner ?