mlc-ai / tokenizers-cpp

Universal cross-platform tokenizers binding to HF and sentencepiece
Apache License 2.0
211 stars 47 forks source link

How to free encode ids without destructing tokenizer? #17

Closed ShukantPal closed 3 months ago

ShukantPal commented 8 months ago
    auto tokenizer = Tokenizer::FromBlobJSON(*blob);
    int i = 0;

    while (q->try_pop(next)) {
        if (next.empty())
            return;

        i++;

        std::vector tokens(tokenizer->Encode(next));
        std::cout << count_processed++ << ", tokens count " << tokens.size() << std::endl;

        if (i >= 1000) {
            tokenizer = Tokenizer::FromBlobJSON(*blob);
            i = 0;
        }
    }

Hey there, for some reason it seems like we're hitting a memory leak somewhere in the tokenizer. We're trying to process 100k pieces of text across multiple threads and count the number of tokens.

However, it seems like the token data is not freed unless the entire tokenizer is destructed. We assume it's happening here:

Screenshot 2023-10-12 at 2 38 53 PM

Is there a way to individually free the "encode ids" returned from tokenizer->Encode so we can use a single tokenizer instance per thread? Right now, we are re-creating the tokenizer after 1000 samples.

TTTdas commented 8 months ago
    auto tokenizer = Tokenizer::FromBlobJSON(*blob);
    int i = 0;

    while (q->try_pop(next)) {
        if (next.empty())
            return;

        i++;

        std::vector tokens(tokenizer->Encode(next));
        std::cout << count_processed++ << ", tokens count " << tokens.size() << std::endl;

        if (i >= 1000) {
            tokenizer = Tokenizer::FromBlobJSON(*blob);
            i = 0;
        }
    }

Hey there, for some reason it seems like we're hitting a memory leak somewhere in the tokenizer. We're trying to process 100k pieces of text across multiple threads and count the number of tokens.

However, it seems like the token data is not freed unless the entire tokenizer is destructed. We assume it's happening here: Screenshot 2023-10-12 at 2 38 53 PM

Is there a way to individually free the "encode ids" returned from tokenizer->Encode so we can use a single tokenizer instance per thread? Right now, we are re-creating the tokenizer after 1000 samples.

Hi, I've encountered the same issue, have you resolved it?

tqchen commented 8 months ago

Thanks for reporting it, as of now the tokenizer helper stores a local Vec

the token_ids field get overriden each time when, so my guess is that it will right now results in a de-allocate and then re-allocate of the vector.

tqchen commented 8 months ago

https://github.com/mlc-ai/tokenizers-cpp/pull/18 might alleviate the situation, to reduce possible re-alloc(as a result possible frag). Note that the tokens are still preserved within the Tokenizer, but only the last encoded tokens.

If you want to remove that one(which i think is un-necesary unless you have a lot of tokenizers), then we need some extra command to clear the internal token states.

TTTdas commented 8 months ago

@tqchen I met another problem when using huggingface tokenizer. After loading the it with a unique_ptr and doing Encode and Decode, I destruct the current object. However, it seems that memory is not completely released, causing memory usage to continuously increase (This becomes noticeable in multi-threaded scenarios with significant memory increase). A new unique_ptr is created in the same thread, and the previous one is not properly released I guess? Only when the process ends is the memory completely freed. I constructed the huggingface tokenizer following the usage in example.cc. Is there any additional destruct operation needed?

tqchen commented 8 months ago

As of now we don't have temp memory in the wrapper except for the temp token id, which should be released after it is freed. So i don't have an idea as of now. If you have ways to profile a bit and see what get retained, it might be helpful.

TTTdas commented 8 months ago
void TokenizerDecoder::Decode(std::string sentence) {
    auto tokenizer = tokenizers::Tokenizer::FromBlobJSON(LoadBytesFromFile("json_path"));
    std::vector<int> ids = tokenizer->Encode(sentence);
    std::vector<int> result = process(ids);
    std::string final = tokenizer->Decode(result);
}

As shown in the code above, I constructs a TokenizerDecoder object for each thread separately. When current sentence completes, the TokenizerDecoder is destructed, and a new object is constructed for the next processing. However, the memory usage increases rapidly.

On the contrary, if I choose to load a global tokenizer and share it among threads using std::move to convert it from unique_ptr into a shared_ptr, the memory increase will become quite slow.

TokenizerDecoder::TokenizerDecoder(std::shared_ptr<tokenizers::Tokenizer> g_tokenizer) : m_tokenizer(g_tokenizer)

void TokenizerDecoder::Decode(std::string sentence) {
    std::vector<int> ids = m_tokenizer->Encode(sentence);
    std::vector<int> result = process(ids);
    std::string final = m_tokenizer->Decode(sentence);
}

It seems that the second way uses only one tokenizer resource, while the first does not fully release the memory. So, I would like to confirm whether it's an issue with how I'm using the tokenizer, possibly missing certain operations (e.g., free)? I'm currently following the usage in example.cc.

tqchen commented 8 months ago

I think free is likely called already, when the tokenizer destructs, https://github.com/mlc-ai/tokenizers-cpp/blob/main/rust/src/lib.rs#L182

but you can double check. I am not sure if there are memory resources taken related to loading these file thouh

TTTdas commented 8 months ago

I'll check it again. Thanks!