Closed nimitpattanasri closed 11 months ago
Hi @nimitpattanasri,
Of course, you can build such function with any tool you want. Here is a function that read samples triplets from disks, the idea is to read chunksize
rows and then yield it to train the model and repeat it for total
times in order to read the whole file once.
import tarfile
from neural_cherche import utils
def iter(batch_size: int = 64, chunksize: int = 300, total = 1000):
"""Iter over tar file"""
tar = tarfile.open("./drive/MyDrive/GPU/data/triplets.tar")
file = tar.extractfile("data/msmarco/triplets/raw.tsv")
for _ in range(total):
X = []
for _ in range(chunksize):
row = file.readline().decode().split("\t")
X.append(
(row[0], row[1], row[2])
)
yield from utils.iter(X=X, epochs=1, batch_size=batch_size, shuffle=True)
You can also use pandas to read your file by chunk.
Thank you for the detailed explanation and the helpful code example.
Hi, I have triplets whose size does not fit my memory. Is there a way to avoid OOM, say, by loading "X" incrementally from a file instead?