Need help building LVIS locally

JKelle commented 11 months ago

What I need help with / What I was wondering I need help downloading the LVIS dataset to my EC2 instance.

What I've tried so far First, I copied the changes from https://github.com/tensorflow/datasets/pull/5094. Then, I tried using the SDK to download_and_prepare the dataset as follows

import tensorflow_datasets as tfds

builder = tfds.builder("lvis")
builder.download_and_prepare()

I also tried adding more parameters for the DirectRunner

import apache_beam as beam
import tensorflow_datasets as tfds

builder = tfds.builder("lvis")
flags = ["--direct_num_workers=4", "--direct_running_mode=multi_processing"]
builder.download_and_prepare(
    download_config=tfds.download.DownloadConfig(
        beam_runner="DirectRunner",
        beam_options=beam.options.pipeline_options.PipelineOptions(flags=flags),
    )
)

After around 10ish minutes I can see 4 CPUs at near 100% utilization, so I think the builder is working. It runs for a while, 30 minutes to a couple hours depending on how many workers I specify, then either hits an error or runs out of memory and gets killed. If I remember correctly, this dataset is about ~25 GB in size. My machine has 64 GB of RAM.

It would be nice if... It would be most convenient for me if I could just download an already built version of the dataset so I could avoid needing to build it myself. I don't really understand what goes on during the build. I just need this dataset locally in TFDS format so I can train a model that's been written to consume this dataset in this format. I'd rather not have to learn about Apache Beam and set up Google Cloud infrastructure just to get a 25 GB dataset.

If that's not possible, then it would be nice if I could build the LVIS dataset locally more easily.

Environment information (if applicable)

Operating System: Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1103-aws aarch64)
Python version: 3.10.13
tensorflow version: 2.14.0
tensorflow-cpu-aws version: 2.14.0
tensorflow-datasets version: 4.9.3
tensorflow-io-gcs-filesystem version: 0.34.0
apache-beam version: 2.51.0
EC2 instance type: r6g.2xlarge

marcenacp commented 11 months ago

Thanks for reaching out. We can unfortunately not host prepared datasets ourselves because some datasets have specific licensing terms.

As far as your issue is concerned, Beam usually allows to build much bigger datasets. So the issue seems to concern that one dataset only and the code in the dataset builder. We'd like to understand why it runs out-of-memory. Would it be possible for you to run a heap inspection [1] before it crashes, and report your findings?

[1] Python has a few internals or libraries for this:

Rahulraj0308 commented 8 months ago

@marcenacp, is the dataset hosted on cloud like aws or google cloud storage?? if it is we can directly download it form there and Since you're running into memory limitations we might consider using an EC2 instance with more memory, such as r6g.4xlarge or larger, to accommodate the dataset processing.

phamnhuvu-dev commented 7 months ago

I have the same problem My Machine: WSL2 64GB RAM, GPU RTX 4090

(owl_vit) phamnhuvu@PhamNhuVu:~$ python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow_datasets as tfds
>>> ds = tfds.load('lvis')
2024-02-28 19:21:30.443798: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-28 19:21:30.464700: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-28 19:21:30.464747: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-28 19:21:30.465379: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-28 19:21:30.468686: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-28 19:21:30.832186: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-02-28 19:21:31.503426: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".
Downloading and preparing dataset 25.35 GiB (download: 25.35 GiB, generated: 23.04 GiB, total: 48.39 GiB) to /home/phamnhuvu/tensorflow_datasets/lvis/1.3.0...
Extraction completed...: 0 file [00:00, ? file/s]██████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 1404.12 url/s]
Dl Size...: 100%|█████████████████████████████████████████████████████████████| 27215681797/27215681797 [00:00<00:00, 5140540530662.18 MiB/s]
Dl Completed...: 100%|█████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 1285.92 url/s]
Generating splits...:   0%|                                                                                       | 0/4 [00:00<?, ? splits/s]WARNING:absl:**************************** WARNING *********************************
Warning: The dataset you're trying to generate is using Apache Beam,
yet no `beam_runner` nor `beam_options` was explicitly provided.

Some Beam datasets take weeks to generate, so are usually not suited
for single machine generation. Please have a look at the instructions
to setup distributed generation:

https://www.tensorflow.org/datasets/beam_datasets#generating_a_beam_dataset
**********************************************************************
2024-02-28 19:29:01.039584: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.184540: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.184598: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.186879: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.186919: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.186943: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.354102: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.354154: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.354169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2022] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2024-02-28 19:29:01.354211: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.354253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21458 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9
Killed

rishabh-akridata commented 7 months ago

Hello, Has anyone found a solution for this? I am also facing the same issue, the process gets killed after sticking in the Apache beam for some time.

phamnhuvu-dev commented 7 months ago

Killed after running 81 minutes https://github.com/tensorflow/datasets/assets/22906656/427e2934-11ed-40bf-8e7e-8dd599f66005

phamnhuvu-dev commented 7 months ago

I don't have any problems with the coco dataset.

2024-02-28 18:20:46.938048: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-28 18:20:46.960985: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:absl:You use TensorFlow DType <dtype: 'int64'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to int64.
2024-02-28 18:20:48.154348: W tensorflow/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".
Downloading and preparing dataset 19.57 GiB (download: 19.57 GiB, generated: Unknown size, total: 19.57 GiB) to /root/tensorflow_datasets/coco/2017_panoptic/1.1.0...
Extraction completed...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [06:41<00:00, 100.48s/ file]
Dl Size...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20152447948/20152447948 [06:41<00:00, 50141956.99 MiB/s]
Dl Completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [06:41<00:00, 133.97s/ url]
Extraction completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118287/118287 [00:50<00:00, 2351.85 file/s]
Extraction completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 2260.43 file/s]
Dataset coco downloaded and prepared to /root/tensorflow_datasets/coco/2017_panoptic/1.1.0. Subsequent calls will reuse this data.                                                           
2024-02-28 18:31:43.882797: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node      
Your kernel may have been built without NUMA support.
2024-02-28 18:31:43.887190: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:43.887240: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:43.889959: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:43.889994: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:43.890015: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:44.073422: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:44.073472: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:44.073487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1725] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2024-02-28 18:31:44.073511: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:44.073544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1638] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21194 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9

rishabh-akridata commented 7 months ago

@phamnhuvu-dev Okay, but I want to replicate the results on the LVIS dataset only. Any other workaround for this to bypass this issue?

phamnhuvu-dev commented 6 months ago

@rishabh-akridata I use the COCO dataset instead of LVIS dataset

rohit901 commented 6 months ago

It would be super helpful if we can download pre-built LVIS val dataset in TFDS format, does anyone have the links for it?

rohit901 commented 6 months ago

I tried increasing num workers in my machine as it has around 256 cores and 200GB memory, but still not able to build tfds dataset of LVIS val split.

Can you please guide me, i require this tfds dataset of lvis val split

2024-03-06 11:24:53.330928: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".
Downloading and preparing dataset 25.35 GiB (download: 25.35 GiB, generated: 23.04 GiB, total: 48.39 GiB) to /l/users/rohit.bharadwaj/RNCDL_extras/owl_vit/data/lvis/1.3.0...
Extraction completed...: 0 file [00:00, ? file/s]
Dl Size...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27215681797/27215681797 [00:00<00:00, 441532968804.31 MiB/s]
Dl Completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 112.66 url/s]
WARNING:apache_beam.runners.portability.local_job_service:Worker: severity: WARN timestamp {   seconds: 1709710635   nanos: 954519271 } message: "No semi_persistent_directory found: Functions defined in __main__ (interactive session) may fail." log_location: "/home/rohit.bharadwaj/.conda/envs/scenic/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py:361" thread: "MainThread"
WARNING:apache_beam.runners.portability.local_job_service:Worker: severity: WARN timestamp {   seconds: 1709710635   nanos: 956524848 } message: "Discarding unparseable args: ['--direct_runner_use_stacked_bundle']" log_location: "/home/rohit.bharadwaj/.conda/envs/scenic/lib/python3.11/site-packages/apache_beam/options/pipeline_options.py:372" thread: "MainThread"

>  File "/home/rohit.bharadwaj/.conda/envs/scenic/lib/python3.11/site-packages/grpc/_channel.py", line 968, in _next
    return self._next()
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {created_time:"2024-03-06T11:45:33.860667744+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
           ^^^^^^^^^^^^
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {grpc_message:"Socket closed", grpc_status:14, created_time:"2024-03-06T11:45:33.860667995+04:00"}"
>
  File "/home/rohit.bharadwaj/.conda/envs/scenic/lib/python3.11/site-packages/grpc/_channel.py", line 542, in __next__
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:37189 {created_time:"2024-03-06T11:45:33.866557836+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>  File "<frozen runpy>", line 88, in _run_code
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {created_time:"2024-03-06T11:45:33.879721749+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {created_time:"2024-03-06T11:45:33.877178203+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {created_time:"2024-03-06T11:45:33.860551262+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:37189 {created_time:"2024-03-06T11:45:33.876286223+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {created_time:"2024-03-06T11:45:33.860943561+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
Traceback (most recent call last):
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {grpc_message:"Socket closed", grpc_status:14, created_time:"2024-03-06T11:45:33.879805441+04:00"}"
>                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rohit.bharadwaj/.conda/envs/scenic/lib/python3.11/site-packages/apache_beam/runners/worker/data_plane.py", line 669, in _read_inputs
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {created_time:"2024-03-06T11:45:33.865689797+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
           ^^^^^^^^^^^^
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:37189 {created_time:"2024-03-06T11:45:33.877735885+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {grpc_message:"Socket closed", grpc_status:14, created_time:"2024-03-06T11:45:33.878120773+04:00"}"
>
           ^^^^^^^^^^^^

phamnhuvu-dev commented 6 months ago

https://github.com/tensorflow/datasets/assets/22906656/d5dee028-1baa-453d-a61a-9297e465169a

@marcenacp There is a problem at the extraction step.

MattLiutt commented 4 months ago

@rishabh-akridata I use the COCO dataset instead of LVIS dataset

may I know if you tried training/finetuning based on your own dataset?

tensorflow / datasets

Need help building LVIS locally #5113