truera / trulens

Evaluation and Tracking for LLM Experiments
https://www.trulens.org/
MIT License
2.07k stars 181 forks source link

TrueLens on the cluster crashes and don't start leads to memory spike and pod restart #1260

Open LatVAlY opened 3 months ago

LatVAlY commented 3 months ago

Bug Description after adding logs after the TruLens initialization tru = Tru() the trueLens is initialized but it spikes memory to a level where the pod runs out of memory, disabling TrueLens everything works fine (POD).

I have increased memory limit to 2G but that didn't resolve the issue either Environment: trulens-eval = "^0.31.0"

I don't know what other additional info I should put here to help resolve or debug the issue, but you need more info happy to provide

dosubot[bot] commented 3 months ago

Hey @LatVAlY! 🌟 I'm here to assist you with any bugs, questions, or contributions you have for the repository. Let's work together to resolve this issue!

To resolve the issue with TrueLens causing a memory spike and pod restart on your cluster after initialization, you can use the memory_suggestions context manager to catch out-of-memory issues and provide suggestions for reducing memory usage. Here is an example of how you can use it:

from trulens.nn.backend import memory_suggestions

# Example usage
with memory_suggestions("batch size=1000"):
    # Your memory-intensive code here
    pass

Additionally, consider reducing memory-impactful parameters such as batch size, intervention sizes, and data types. The rebatch function can help you to rebatch your data into smaller batches:

from trulens.nn.backend import rebatch

# Example usage
for batch in rebatch(your_data, batch_size=500):
    # Process each batch
    pass

Moreover, you can use custom memory management techniques as demonstrated in the CustomMemory class from the trulens_eval library. This class allows you to control memory allocation and delay, which can help in managing memory usage more effectively:

import sys
import time

class CustomMemory:

    def __init__(self, delay: float = 0.0, alloc: int = 1024 * 1024):
        self.alloc = alloc
        self.delay = delay

        # keep a chunk of data allocated permanently:
        self.temporary = [0x42] * self.alloc

        self.messages = []

    def remember(self, data: str):
        if self.delay > 0.0:
            time.sleep(self.delay)

        self.messages.append(
            data +
            f" and I'm keeping around {sys.getsizeof(self.temporary)} bytes"
        )

By using these tools and techniques, you can manage memory usage more effectively and prevent memory spikes that lead to pod restarts [1][2].

To continue talking to Dosu, mention @dosu.

sfc-gh-jreini commented 2 weeks ago

Hi @LatVAlY - we aimed to address the root of this problem with TruLens 1.0. Was this ever resolved? If not, can you let us know if it still occurs with TruLens 1.0?