Describe the bug
The data cleaning process is incorrectly pruning messages that should not be removed. This issue is causing valid messages to be excluded from the training and evaluation datasets.
To Reproduce
Steps to reproduce the behavior:
Load the database using the command: database load
Execute the data pruning command: data prune data.json
Observe the logs and the resulting pruned data file.
Expected behavior
The pruning process should only remove messages that do not have corresponding embeddings in the database. Valid messages should remain in the training and evaluation datasets.
Desktop (please complete the following information):
OS: Windows 11
Version: 0.1.0
Additional context
The issue seems to be related to the regex pattern used in the CleanPunctuationSpaces method, which is modifying the middle parts of the messages. This causes the FindEmbedding method to fail in finding the correct embeddings, leading to incorrect pruning.
Describe the bug The data cleaning process is incorrectly pruning messages that should not be removed. This issue is causing valid messages to be excluded from the training and evaluation datasets.
To Reproduce Steps to reproduce the behavior:
database load
data prune data.json
Expected behavior The pruning process should only remove messages that do not have corresponding embeddings in the database. Valid messages should remain in the training and evaluation datasets.
Desktop (please complete the following information):
Additional context The issue seems to be related to the regex pattern used in the
CleanPunctuationSpaces
method, which is modifying the middle parts of the messages. This causes theFindEmbedding
method to fail in finding the correct embeddings, leading to incorrect pruning.