Opening a library leaves a file open on the host OS

DrNickClarke commented 4 days ago

Describe the bug

(Reported by a client on 2024-10-09)

In the clients words: Reading a dataframe from ArcticDB opens a number of files on the host operating system. These files are not closed once the dataframe is read. Sequentially reading dataframe leads to the operating system's maximum number of open files limit being breached and the process being killed.

Steps/Code to Reproduce

1. Setup data Script

from arcticdb import Arctic

# Initialize Arctic connection
arctic = Arctic("lmdb://tmp/arcticdb_test")
symbol = "test_symbol"

import pandas as pd
import numpy as np

for x in range(0, 100):
    library_name = f"tmp.test_library_{x}_"
    library = arctic.get_library(library_name, create_if_missing=True)

    # create a random dataframe
    df = pd.DataFrame(np.random.rand(20, 60))

    # write the dataframe to the library
    library.write(symbol, df)

    if x % 10 == 0:
        print(f"Iteration {x} has DataFrame shape: {df.shape}")

list_of_libraries = arctic.list_libraries()
print(f"Number of libraries: {len(list_of_libraries)}")

2. Read Data Script

from arcticdb import Arctic

# Initialize Arctic connection
arctic = Arctic("lmdb://tmp/arcticdb_test")
symbol = "test_symbol"

import subprocess

def run_command(command: str) -> str:
    """Runs a shell command and returns the output as a string."""
    result = subprocess.run(command, shell=True, capture_output=True, text=True)
    return result.stdout.strip()

def get_max_open_files() -> int:
    """Returns the maximum number of open files allowed by the system."""
    file_max_output = run_command("sysctl fs.file-max")
    return int(file_max_output.split(" ")[-1])

def get_current_open_files() -> int:
    """Returns the current number of open files."""
    return int(run_command("lsof | wc -l"))

# Get system limits and initialize tracking variables
max_open_files = get_max_open_files()
open_files_per_iteration = [get_current_open_files()]
data_frames_read_count = [0]

# Frequency of checking open files
check_files_every_n_iterations = 5

# Initialize a counter for successfully read dataframes
successful_reads = 0

# Iterate over libraries and read data
for library_name in list_of_libraries:
    try:
        library = arctic.get_library(library_name)
        df = library.read(symbol=symbol).data
        successful_reads += 1  # Increment the counter for successful reads
    except Exception as e:
        print(f"Failed to read {library_name} for {symbol}: {e}")
        continue

    if successful_reads % check_files_every_n_iterations == 0:
        print(f"Dataframe #{successful_reads} has DataFrame shape: {df.shape}")

        current_open_files = get_current_open_files()
        open_files_per_iteration.append(current_open_files)
        data_frames_read_count.append(successful_reads)

        print(f"    Current number of open files: {current_open_files}")
        print(f"    Used resources: {current_open_files / max_open_files:.2%}")

    del df

Expected Results

Charts of open files from running the script

OS, Python Version and ArcticDB Version

Linux ArcticDB 4.4.2

Backend storage used

AWS S3, LMDB

Additional Context

The repro code opens lots of libraries

DrNickClarke commented 4 days ago

@G-D-Petrov has investigated this and found that the cause is that library connections are cached. Reading many dataframes from a few libraries (which is the more typical use case) does not keep many files open.

DrNickClarke commented 3 days ago

Further details https://arcticdb.slack.com/archives/C064NA7BK5H/p1728482205597269

man-group / ArcticDB