Load the triplets input incrementally

nimitpattanasri commented 11 months ago

Hi, I have triplets whose size does not fit my memory. Is there a way to avoid OOM, say, by loading "X" incrementally from a file instead?

raphaelsty commented 11 months ago

Hi @nimitpattanasri,

Of course, you can build such function with any tool you want. Here is a function that read samples triplets from disks, the idea is to read chunksize rows and then yield it to train the model and repeat it for total times in order to read the whole file once.

import tarfile
from neural_cherche import utils

def iter(batch_size: int = 64, chunksize: int = 300, total = 1000):
    """Iter over tar file"""
    tar = tarfile.open("./drive/MyDrive/GPU/data/triplets.tar")
    file = tar.extractfile("data/msmarco/triplets/raw.tsv")
    for _ in range(total):

        X = []

        for _ in range(chunksize):

            row = file.readline().decode().split("\t")

            X.append(
                (row[0], row[1], row[2])
            )

        yield from utils.iter(X=X, epochs=1, batch_size=batch_size, shuffle=True)

You can also use pandas to read your file by chunk.

nimitpattanasri commented 11 months ago

Thank you for the detailed explanation and the helpful code example.

raphaelsty / neural-cherche

Load the triplets input incrementally #8