p3nGu1nZz / Tau

Tau LLM made with Unity 6 ML Agents
MIT License
11 stars 4 forks source link

Missing Embeddings During Training Due to Out-of-Sync Data and Database #12

Closed p3nGu1nZz closed 1 month ago

p3nGu1nZz commented 1 month ago

Describe the bug When running the training agent tau {filename} command, we encounter errors due to missing embeddings in the database. This issue arises because the data.json file is out of sync with the database.bin generated from the data load {filename} command. Previously, we manually removed the problematic messages from data.json, but this is not a sustainable solution.

To Reproduce Steps to reproduce the behavior:

  1. Run the data load {filename} command to generate database.bin.
  2. Execute the training agent tau {filename} command.
  3. Observe the errors related to missing embeddings.

Expected behavior The system should automatically handle missing embeddings by attempting to regenerate them for the missing token strings, rather than requiring manual edits to data.json.

Screenshots N/A

Desktop (please complete the following information):

Additional context We have implemented prune and trim commands to clean the strings used in messages and token names in the database tables. However, the issue persists with missing embeddings during training. We propose catching the error and, if it is due to missing embeddings, attempting to repair the table by regenerating the embeddings for the missing token strings.

p3nGu1nZz commented 1 month ago

making a DataAuditor static class which handles this

p3nGu1nZz commented 1 month ago

fixed with data auditor