Closed ShukantPal closed 3 months ago
auto tokenizer = Tokenizer::FromBlobJSON(*blob); int i = 0; while (q->try_pop(next)) { if (next.empty()) return; i++; std::vector tokens(tokenizer->Encode(next)); std::cout << count_processed++ << ", tokens count " << tokens.size() << std::endl; if (i >= 1000) { tokenizer = Tokenizer::FromBlobJSON(*blob); i = 0; } }
Hey there, for some reason it seems like we're hitting a memory leak somewhere in the tokenizer. We're trying to process 100k pieces of text across multiple threads and count the number of tokens.
However, it seems like the token data is not freed unless the entire tokenizer is destructed. We assume it's happening here:
Is there a way to individually free the "encode ids" returned from
tokenizer->Encode
so we can use a single tokenizer instance per thread? Right now, we are re-creating the tokenizer after 1000 samples.
Hi, I've encountered the same issue, have you resolved it?
Thanks for reporting it, as of now the tokenizer helper stores a local Vec
the token_ids
field get overriden each time when, so my guess is that it will right now results in a de-allocate and then re-allocate of the vector.
https://github.com/mlc-ai/tokenizers-cpp/pull/18 might alleviate the situation, to reduce possible re-alloc(as a result possible frag). Note that the tokens are still preserved within the Tokenizer, but only the last encoded tokens.
If you want to remove that one(which i think is un-necesary unless you have a lot of tokenizers), then we need some extra command to clear the internal token states.
@tqchen I met another problem when using huggingface tokenizer. After loading the it with a unique_ptr and doing Encode and Decode, I destruct the current object. However, it seems that memory is not completely released, causing memory usage to continuously increase (This becomes noticeable in multi-threaded scenarios with significant memory increase). A new unique_ptr is created in the same thread, and the previous one is not properly released I guess? Only when the process ends is the memory completely freed. I constructed the huggingface tokenizer following the usage in example.cc. Is there any additional destruct operation needed?
As of now we don't have temp memory in the wrapper except for the temp token id, which should be released after it is freed. So i don't have an idea as of now. If you have ways to profile a bit and see what get retained, it might be helpful.
void TokenizerDecoder::Decode(std::string sentence) {
auto tokenizer = tokenizers::Tokenizer::FromBlobJSON(LoadBytesFromFile("json_path"));
std::vector<int> ids = tokenizer->Encode(sentence);
std::vector<int> result = process(ids);
std::string final = tokenizer->Decode(result);
}
As shown in the code above, I constructs a TokenizerDecoder
object for each thread separately. When current sentence completes, the TokenizerDecoder
is destructed, and a new object is constructed for the next processing. However, the memory usage increases rapidly.
On the contrary, if I choose to load a global tokenizer and share it among threads using std::move
to convert it from unique_ptr
into a shared_ptr
, the memory increase will become quite slow.
TokenizerDecoder::TokenizerDecoder(std::shared_ptr<tokenizers::Tokenizer> g_tokenizer) : m_tokenizer(g_tokenizer)
void TokenizerDecoder::Decode(std::string sentence) {
std::vector<int> ids = m_tokenizer->Encode(sentence);
std::vector<int> result = process(ids);
std::string final = m_tokenizer->Decode(sentence);
}
It seems that the second way uses only one tokenizer resource, while the first does not fully release the memory. So, I would like to confirm whether it's an issue with how I'm using the tokenizer, possibly missing certain operations (e.g., free
)? I'm currently following the usage in example.cc
.
I think free is likely called already, when the tokenizer destructs, https://github.com/mlc-ai/tokenizers-cpp/blob/main/rust/src/lib.rs#L182
but you can double check. I am not sure if there are memory resources taken related to loading these file thouh
I'll check it again. Thanks!
Hey there, for some reason it seems like we're hitting a memory leak somewhere in the tokenizer. We're trying to process 100k pieces of text across multiple threads and count the number of tokens.
However, it seems like the token data is not freed unless the entire tokenizer is destructed. We assume it's happening here:
Is there a way to individually free the "encode ids" returned from
tokenizer->Encode
so we can use a single tokenizer instance per thread? Right now, we are re-creating the tokenizer after 1000 samples.