In the clients words: Reading a dataframe from ArcticDB opens a number of files on the host operating system. These files are not closed once the dataframe is read. Sequentially reading dataframe leads to the operating system's maximum number of open files limit being breached and the process being killed.
Steps/Code to Reproduce
1. Setup data Script
from arcticdb import Arctic
# Initialize Arctic connection
arctic = Arctic("lmdb://tmp/arcticdb_test")
symbol = "test_symbol"
import pandas as pd
import numpy as np
for x in range(0, 100):
library_name = f"tmp.test_library_{x}_"
library = arctic.get_library(library_name, create_if_missing=True)
# create a random dataframe
df = pd.DataFrame(np.random.rand(20, 60))
# write the dataframe to the library
library.write(symbol, df)
if x % 10 == 0:
print(f"Iteration {x} has DataFrame shape: {df.shape}")
list_of_libraries = arctic.list_libraries()
print(f"Number of libraries: {len(list_of_libraries)}")
2. Read Data Script
from arcticdb import Arctic
# Initialize Arctic connection
arctic = Arctic("lmdb://tmp/arcticdb_test")
symbol = "test_symbol"
import subprocess
def run_command(command: str) -> str:
"""Runs a shell command and returns the output as a string."""
result = subprocess.run(command, shell=True, capture_output=True, text=True)
return result.stdout.strip()
def get_max_open_files() -> int:
"""Returns the maximum number of open files allowed by the system."""
file_max_output = run_command("sysctl fs.file-max")
return int(file_max_output.split(" ")[-1])
def get_current_open_files() -> int:
"""Returns the current number of open files."""
return int(run_command("lsof | wc -l"))
# Get system limits and initialize tracking variables
max_open_files = get_max_open_files()
open_files_per_iteration = [get_current_open_files()]
data_frames_read_count = [0]
# Frequency of checking open files
check_files_every_n_iterations = 5
# Initialize a counter for successfully read dataframes
successful_reads = 0
# Iterate over libraries and read data
for library_name in list_of_libraries:
try:
library = arctic.get_library(library_name)
df = library.read(symbol=symbol).data
successful_reads += 1 # Increment the counter for successful reads
except Exception as e:
print(f"Failed to read {library_name} for {symbol}: {e}")
continue
if successful_reads % check_files_every_n_iterations == 0:
print(f"Dataframe #{successful_reads} has DataFrame shape: {df.shape}")
current_open_files = get_current_open_files()
open_files_per_iteration.append(current_open_files)
data_frames_read_count.append(successful_reads)
print(f" Current number of open files: {current_open_files}")
print(f" Used resources: {current_open_files / max_open_files:.2%}")
del df
@G-D-Petrov has investigated this and found that the cause is that library connections are cached. Reading many dataframes from a few libraries (which is the more typical use case) does not keep many files open.
Describe the bug
(Reported by a client on 2024-10-09)
In the clients words: Reading a dataframe from ArcticDB opens a number of files on the host operating system. These files are not closed once the dataframe is read. Sequentially reading dataframe leads to the operating system's maximum number of open files limit being breached and the process being killed.
Steps/Code to Reproduce
1. Setup data Script
2. Read Data Script
Expected Results
Charts of open files from running the script
OS, Python Version and ArcticDB Version
Linux ArcticDB 4.4.2
Backend storage used
AWS S3, LMDB
Additional Context
The repro code opens lots of libraries