Closed weiyuan-c closed 10 months ago
Sorry, I haven't encountered the same issue during my implementation process. You can try using short sentences as data and observe the tokenization process for debugging.
I resolved this issue by switching the version of the data preprocessing from v1 to llama_2. I plan to spend some time understanding the differences between these two functions. Thank you for your response!
I encountered the same issue. How did you resolve it? What is the possible solution?
I encountered the same issue. How did you resolve it? What is the possible solution?
I would like to share my observations here, though I cannot assert with certainty that this is the absolute reason. The potential explanation might be linked to tokenization. For example, the encoding result for 'USER' could change when it is preceded by another special token, like "\USER".
Can you please check the released code again @huangb23 ?
@itruonghai @weiyuan-c I have been able to identify the root cause of this issue, which seems to be related to the updates in the Transformers library. As a temporary solution, you can address this problem by downgrading the Transformers library to an earlier version (4.31.0) using the following command: pip install transformers==4.31.0
Thank you for releasing the code. I have followed the instructions in the README file and downloaded the provided files, organizing them as suggested. However, I encountered some issues and have been unable to pinpoint the exact cause. Did you experience any similar bugs during implementation? I would greatly appreciate it if you could suggest any potential solutions or troubleshooting steps.