Tokenization mismatch - Githubissues

huangb23 / VTimeLLM

[CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".

https://arxiv.org/pdf/2311.18445.pdf

Other

226 stars 11 forks source link

Tokenization mismatch #10

Closed weiyuan-c closed 10 months ago

weiyuan-c commented 10 months ago

error Thank you for releasing the code. I have followed the instructions in the README file and downloaded the provided files, organizing them as suggested. However, I encountered some issues and have been unable to pinpoint the exact cause. Did you experience any similar bugs during implementation? I would greatly appreciate it if you could suggest any potential solutions or troubleshooting steps.

huangb23 commented 10 months ago

Sorry, I haven't encountered the same issue during my implementation process. You can try using short sentences as data and observe the tokenization process for debugging.

weiyuan-c commented 10 months ago

I resolved this issue by switching the version of the data preprocessing from v1 to llama_2. I plan to spend some time understanding the differences between these two functions. Thank you for your response!

itruonghai commented 10 months ago

I encountered the same issue. How did you resolve it? What is the possible solution?

weiyuan-c commented 10 months ago

I encountered the same issue. How did you resolve it? What is the possible solution?

I would like to share my observations here, though I cannot assert with certainty that this is the absolute reason. The potential explanation might be linked to tokenization. For example, the encoding result for 'USER' could change when it is preceded by another special token, like "\~~USER".~~

itruonghai commented 10 months ago

Can you please check the released code again @huangb23 ?

huangb23 commented 10 months ago

@itruonghai @weiyuan-c I have been able to identify the root cause of this issue, which seems to be related to the updates in the Transformers library. As a temporary solution, you can address this problem by downgrading the Transformers library to an earlier version (4.31.0) using the following command: pip install transformers==4.31.0

© Githubissues.

Githubissues is a development platform for aggregating issues.