Closed Atakey closed 7 months ago
Hey @Atakey when you say that the above part is generated by gpt-4, which part do you refer to? This otherwise sounds interesting, could you share a full reproducer (in this case the sentence are not really created)
Hey @Atakey when you say that the above part is generated by gpt-4, which part do you refer to? This otherwise sounds interesting, could you share a full reproducer (in this case the sentence are not really created)
I found this issue in encode_batch
, but since I am not familiar with the rust language, I discussed it with gpt-4 and let it assist in generating the issue.
You can randomly generate sentences, you shouldn't use like ['Here is an example.'] * 20_000_000
. And randomly generation would be slower than loading sentences from a file.
def generate_random_sentences(size: int, length: int = 10, word_length: int = 3):
"""
Generates a specified number of random strings
Parameters:
size (int): The number of random strings
length (int): The length of the random string, defaults to 10
word_length (int): The length of the word in a single random string, defaults to 3
Returns: Iterator for the generated random string
"""
import numpy as np
# Generate the random string
sentences_chars = np.random.randint(97, 123, (size, length, word_length))
return [" ".join("".join(chr(char) for char in chars) for chars in sentence_chars) for sentence_chars in
sentences_chars]
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I also ran into this. When batching more than one input memory leaks. I worked around it by using multiprocessing to call many instances of the tokenizer on single inputs, which isn't that much slower than the intended batching.
I'm confused. The example doesn't seem to show any memory leak issue, in fact, there's less memory for the "leak" version.
Also please complete the script all the way to make it reproducible.
Problem
I have encountered a memory leak issue in the
batch_encode
function https://github.com/huggingface/tokenizers/blob/11462596d11d886e501091e28acbe1174385087a/bindings/python/src/tokenizer.rs#L1015 This leak becomes more apparent with an increasing number of text inputs, suggesting a correlation between the size of the leak and the number of texts processed.Maybe this problem is not called a memory leak, but it is indeed a problem.
Steps to Reproduce
batch_encode
function to process this dataset.batch_encode(batch_sentences)
, there is a noticeable increase in memory usage, which does not happen when usingbatch_encode(["".join(sentence) for sentence in batch_sentences])
.Simple code to produce
Potential Cause
It appears that the issue might be related to how memory addresses of text data are managed when interfaced between Python and Rust. When passing batch_text directly to batch_encode, Rust functions seem to receive Python-allocated memory addresses directly, which they might not manage correctly. In contrast, using ["".join(text) for text in batch_text] creates new string objects in Python, which are then passed to Rust, possibly allowing Rust to manage these memory locations more effectively.
Suggested Solution
A possible solution could involve ensuring that Rust takes full control of the memory management of the strings it processes. This might require modifications in the batch_encode function to deep copy the string data from Python before processing, allowing Rust to handle memory allocation and deallocation independently.
Note: The above part of the content is generated by gpt-4.