p3nGu1nZz / Tau

Tau LLM made with Unity 6 ML Agents
MIT License
11 stars 4 forks source link

Incorrect Pruning of Valid Messages During Data Cleaning Process #5

Closed p3nGu1nZz closed 1 month ago

p3nGu1nZz commented 1 month ago

Describe the bug The data cleaning process is incorrectly pruning messages that should not be removed. This issue is causing valid messages to be excluded from the training and evaluation datasets.

To Reproduce Steps to reproduce the behavior:

  1. Load the database using the command: database load
  2. Execute the data pruning command: data prune data.json
  3. Observe the logs and the resulting pruned data file.

Expected behavior The pruning process should only remove messages that do not have corresponding embeddings in the database. Valid messages should remain in the training and evaluation datasets.

Desktop (please complete the following information):

Additional context The issue seems to be related to the regex pattern used in the CleanPunctuationSpaces method, which is modifying the middle parts of the messages. This causes the FindEmbedding method to fail in finding the correct embeddings, leading to incorrect pruning.

p3nGu1nZz commented 1 month ago

removed from source