Open miyamonz opened 3 years ago
Hi !
Apache Beam is a framework used to define data transformation pipelines. These pipeline can then be run in many runtimes: DataFlow, Spark, Flink, etc. There also exist a local runner called the DirectRunner. Wikipedia is a dataset that requires some parsing, so to allow the processing to be run on this kind of runtime we're using Apache Beam.
At Hugging Face we've already processed certain versions of wikipedia (the 20200501.en
one for example) so that users can directly download the processed version instead of using Apache Beam to process it.
However for the japanese language we haven't processed it so you'll have to run the processing on your side.
So you do need Apache Beam to process 20200501.ja
.
You can install Apache Beam with
pip install apache-beam
I think we can probably improve the error message to let users know of this subtlety. What #498 implied is that Apache Beam is not needed when you process a dataset that doesn't use Apache Beam.
Thanks for your reply! I understood.
I tried again with installing apache-beam, add beam_runner="DirectRunner"
and an anther mwparserfromhell
is also required so I installed it.
but, it also failed. It exited 1 without error message.
import datasets
# BTW, 20200501.ja doesn't exist at wikipedia, so I specified date argument
wiki = datasets.load_dataset("wikipedia", language="ja", date="20210120", cache_dir="./datasets", beam_runner="DirectRunner")
print(wiki)
and its log is below
Using custom data configuration 20210120.ja
Downloading and preparing dataset wikipedia/20210120.ja-date=20210120,language=ja (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to ./datasets/wikipedia/20210120.ja-date=20210120,language=ja/0.0.0/4021357e28509391eab2f8300d9b689e7e8f3a877ebb3d354b01577d497ebc63...
Killed
I also tried on another machine because it may caused by insufficient resources.
$ python main.py
Using custom data configuration 20210120.ja
Downloading and preparing dataset wikipedia/20210120.ja-date=20210120,language=ja (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to ./datasets/wikipedia/20210120.ja-date=20210120,language=ja/0.0.0/4021357e28509391eab2f8300d9b689e7e8f3a877ebb3d354b01577d497ebc63...
Traceback (most recent call last):
File "main.py", line 3, in <module>
wiki = datasets.load_dataset("wikipedia", language="ja", date="20210120", cache_dir="./datasets", beam_runner="DirectRunner")
File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/datasets/load.py", line 609, in load_dataset
builder_instance.download_and_prepare(
File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/datasets/builder.py", line 526, in download_and_prepare
self._download_and_prepare(
File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/datasets/builder.py", line 1069, in _download_and_prepare
pipeline_results = pipeline.run()
File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/pipeline.py", line 561, in run
return self.runner.run_pipeline(self, self._options)
File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/direct/direct_runner.py", line 126, in run_pipeline
return runner.run_pipeline(pipeline, options)
File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 182, in run_pipeline
self._latest_run_result = self.run_via_runner_api(
File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 193, in run_via_runner_api
return self.run_stages(stage_context, stages)
File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 358, in run_stages
stage_results = self._run_stage(
File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 549, in _run_stage
last_result, deferred_inputs, fired_timers = self._run_bundle(
File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 595, in _run_bundle
result, splits = bundle_manager.process_bundle(
File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 888, in process_bundle
self._send_input_to_worker(process_bundle_id, transform_id, elements)
File "/home/miyamonz/.cache/pypoetry/virtualenvs/try-datasets-4t4JWXxu-py3.8/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 765, in _send_input_to_worker
data_out.write(byte_stream)
File "apache_beam/coders/stream.pyx", line 42, in apache_beam.coders.stream.OutputStream.write
File "apache_beam/coders/stream.pyx", line 47, in apache_beam.coders.stream.OutputStream.write
File "apache_beam/coders/stream.pyx", line 109, in apache_beam.coders.stream.OutputStream.extend
AssertionError: OutputStream realloc failed.
Hi @miyamonz,
I tried replicating this issue using the same snippet used by you. I am able to download the dataset without any issues, although I stopped it in the middle because the dataset is huge.
Based on a similar issue here, it could be related to your environment setup, although I am just guessing here. Can you share these details?
thanks for your reply and sorry for my late response.
my local machine environment info
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal
RTX 2070 super
Inside WSL, there is no nvidia-msi command. I don't know why.
But, torch.cuda.is_available()
is true and when I start something ML training code GPU usage is growing up, so I think it works.
From PowerShell, there is nvidia-smi.exe and result is below.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.05 Driver Version: 470.05 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:09:00.0 On | N/A |
| 0% 30C P8 19W / 175W | 523MiB / 8192MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1728 C+G Insufficient Permissions N/A |
| 0 N/A N/A 3672 C+G ...ekyb3d8bbwe\YourPhone.exe N/A |
| 0 N/A N/A 6304 C+G ...2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 8648 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 9536 C+G ...y\ShellExperienceHost.exe N/A |
| 0 N/A N/A 10668 C+G ...5n1h2txyewy\SearchApp.exe N/A |
| 0 N/A N/A 10948 C+G ...artMenuExperienceHost.exe N/A |
| 0 N/A N/A 11988 C+G ...8wekyb3d8bbwe\Cortana.exe N/A |
| 0 N/A N/A 12464 C+G ...cw5n1h2txyewy\LockApp.exe N/A |
| 0 N/A N/A 13280 C+G ...upport\CEF\Max Helper.exe N/A |
| 0 N/A N/A 15948 C+G ...t\GoogleIMEJaRenderer.exe N/A |
| 0 N/A N/A 16128 C+G ...ram Files\Slack\Slack.exe N/A |
| 0 N/A N/A 19096 C+G ...8bbwe\WindowsTerminal.exe N/A |
+-----------------------------------------------------------------------------+
I don't know what should I show in such a case. If it's not enough, please tell me some commands.
I surveyed more and I found 2 issues.
About the first one, I wrote it as a new issue. https://github.com/huggingface/datasets/issues/2031
The error I mentioned in the previous comment above, which occurred on my local machine, is no longer occurring.
But, it still failed. In the previous comment, I wrote AssertionError: OutputStream realloc failed.
happen on another machine. It also happens on my local machine.
Here's what I've tried.
the wikipedia.py downloads these xml.bz2 files based on dumpstatus.json In Japanese Wikipedia dataset that I specified, it will download these 6 files.
https://dumps.wikimedia.org/jawiki/20210120/dumpstatus.json
and filtered json based on wikipedia.py is below.
{
"jobs": {
"articlesmultistreamdump": {
"files": {
"jawiki-20210120-pages-articles-multistream1.xml-p1p114794.bz2": {
"url": "/jawiki/20210120/jawiki-20210120-pages-articles-multistream1.xml-p1p114794.bz2"
},
"jawiki-20210120-pages-articles-multistream2.xml-p114795p390428.bz2": {
"url": "/jawiki/20210120/jawiki-20210120-pages-articles-multistream2.xml-p114795p390428.bz2"
},
"jawiki-20210120-pages-articles-multistream3.xml-p390429p902407.bz2": {
"url": "/jawiki/20210120/jawiki-20210120-pages-articles-multistream3.xml-p390429p902407.bz2"
},
"jawiki-20210120-pages-articles-multistream4.xml-p902408p1721646.bz2": {
"url": "/jawiki/20210120/jawiki-20210120-pages-articles-multistream4.xml-p902408p1721646.bz2"
},
"jawiki-20210120-pages-articles-multistream5.xml-p1721647p2807947.bz2": {
"url": "/jawiki/20210120/jawiki-20210120-pages-articles-multistream5.xml-p1721647p2807947.bz2"
},
"jawiki-20210120-pages-articles-multistream6.xml-p2807948p4290013.bz2": {
"url": "/jawiki/20210120/jawiki-20210120-pages-articles-multistream6.xml-p2807948p4290013.bz2"
}
}
}
}
}
So, I tried running with fewer resources by modifying this line.
https://github.com/huggingface/datasets/blob/13a5b7db992ad5cf77895e4c0f76595314390418/datasets/wikipedia/wikipedia.py#L524
I changed it like this. just change filepaths list.
| "Initialize" >> beam.Create(filepaths[:1])
and I added a print line inside for the loop of _extract_content.
like this if(i % 100000 == 0): print(i)
first, without modification, it always stops after all _extract_content is done.
filepaths[:1]
then it succeeded.filepaths[:2]
then it failed.
I don't try all patterns because each pattern takes a long time.It seems it's successful when the entire file size is small.
so, at least it doesn't file-specific issue.
I don't know it's true but I think when beam_writter writes into a file, it consumes memory depends on its entire file. but It's correct Apache Beam's behavior? I'm not familiar with this library.
I don't know if this is related, but there is this issue on the wikipedia processing that you reported at #2031 (open PR is at #2037 ) . Does the fix your proposed at #2037 helps in your case ?
And for information, the DirectRunner of Apache Beam is not optimized for memory intensive tasks, so you must be right when you say that it uses the memory for the entire file.
the #2037 doesn't solve my problem directly, but I found the point!
https://github.com/huggingface/datasets/blob/349ac4398a3bcae6356f14c5754483383a60e8a4/datasets/wikipedia/wikipedia.py#L523
this beam.transforms.Reshuffle()
cause the memory error.
it makes sense if I consider the shuffle means. Beam's reshuffle seems need put all data in memory. Previously I doubt that this line causes error, but at that time another bug showed in #2037 made error, so I can't found it.
Anyway, I comment out this line, and run load_dataset, then it works!
wiki = datasets.load_dataset(
"./wikipedia.py",
cache_dir="./datasets",
beam_runner="DirectRunner",
language="ja",
date="20210120",
)["train"]
Dataset has already shuffle function. https://github.com/huggingface/datasets/blob/349ac4398a3bcae6356f14c5754483383a60e8a4/src/datasets/arrow_dataset.py#L2069 So, though I don't know it's difference correctly, but I think Beam's reshuffle isn't be needed. How do you think?
The reshuffle is needed when you use parallelism.
The objective is to redistribute the articles evenly on the workers, since the _extract_content
step generated many articles per file. By using reshuffle, we can split the processing of the articles of one file into several workers. Without reshuffle, all the articles of one file would be processed on the same worker that read the file, making the whole process take a very long time.
Maybe the reshuffle step can be added only if the runner is not a DirectRunner ?
then
ModuleNotFoundError: No module named 'apache_beam'
happend.The error doesn't appear when it's '20200501.en'. I don't know Apache Beam, but according to #498 it isn't necessary when it's saved to local. is it correct?