openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12k stars 818 forks source link

Python memory usage #242

Closed logan-markewich closed 8 months ago

logan-markewich commented 8 months ago

Am I measuring this properly? Is tiktoken using 500MB of memory?

If so, is there any way to control this? image-4

GaurangTandon commented 8 months ago

Hi @logan-markewich did you find any workaround for this?

The increased memory usage tends to crash small Docker containers, so this is really a bottleneck in adopting tiktoken. Is there an alternative that consumes 10x less memory, even if it has slightly reduced accuracy? (in token counting)

hauntsaninja commented 8 months ago

I can't reproduce:

λ cat x.py
import psutil
process = psutil.Process()

def get_memory():
    print(process.memory_info().rss / 1000000)

get_memory()
import tiktoken
get_memory()
tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo").encode
get_memory()
λ python x.py
15.097856
18.997248
78.41792

I tried on a couple different systems. All were roughly about^

Could you provide more information about your system configuration? (Also please don't post screenshots, it makes reproducing your code harder)

GaurangTandon commented 8 months ago

Thanks @hauntsaninja you're right. My comment was based on an older version of tiktoken, and a different setting than a Python script. I will try to reproduce the issue again and post here (perhaps in a separate issue)

logan-markewich commented 8 months ago

@hauntsaninja sorry, I should have mentioned the screenshot above was running in colab. Let me try again on a few other machines

logan-markewich commented 8 months ago

@hauntsaninja here is a link to the colab, this reproduces every time I run it: https://colab.research.google.com/drive/1JnEHblVIPz534yRDcJFD-lTWspXyUKpy?usp=sharing

I also ran locally (macOS, M2 Pro Max, python3.11) and got a similar output

159.744
193.00352
914.06336 <--- !!!
logan-markewich commented 8 months ago

@hauntsaninja wow, I was missing a zero 😢 Turns out it's not a huge deal then tbh. Closing out for now