Is your feature request related to a problem? Please describe.
Currently, we need a script to perform PCA on our embeddings and reduce their dimensionality. This will help in managing our high-dimensional vector space more efficiently.
Describe the solution you'd like
A Python script named reduce.py that:
Takes a JSON file containing vocabulary and embeddings as input.
Performs PCA to reduce the dimensionality of the embeddings.
Saves the reduced embeddings to a new JSON file.
Additionally, we need to create a batch script to call reduce.py in the correct virtual environment, similar to how we handle encoder.py. The batch script should ensure the environment is set up correctly and the script is executed with the appropriate arguments.
Describe alternatives you've considered
Implementing PCA directly in C#, but using Python with scikit-learn is more efficient and easier to manage.
Using incremental PCA, but our dataset size allows us to process it in one go.
Additional context
See encoder.py for an example of how our Python scripts work. We call our Python scripts via a batch file like this for our encoder:
@echo off
set "PROJECT_ROOT=%~dp0\.."
set "TEMP_DIR=%PROJECT_ROOT%\.temp"
set "PYTHON_INSTALLER=python-3.10.11-amd64.exe"
set "PYTHON_DIR=%USERPROFILE%\.python\Python310"
set "VENV_NAME=ml-agents"
set "VENV_DIR=%PROJECT_ROOT%\venv\%VENV_NAME%"
set "ACTIVATE_SCRIPT=%~dp0activate.bat"
set "DEACTIVATE_SCRIPT=%~dp0deactivate.bat"
set "CLEAN_SCRIPT=%~dp0clean.bat"
set "UTILITIES_SCRIPT=%~dp0utilities.bat"
set "ML_AGENTS_DIR=%PROJECT_ROOT%\ml-agents"
set "ML_AGENTS_ENVS_INSTALL=%ML_AGENTS_DIR%\ml-agents-envs"
set "ML_AGENTS_INSTALL=%ML_AGENTS_DIR%\ml-agents"
We also need to update our data load terminal command in our runtime. This will run after we extract the vocab but before we build the training and evaluation data tables.
Is your feature request related to a problem? Please describe. Currently, we need a script to perform PCA on our embeddings and reduce their dimensionality. This will help in managing our high-dimensional vector space more efficiently.
Describe the solution you'd like A Python script named
reduce.py
that:Additionally, we need to create a batch script to call
reduce.py
in the correct virtual environment, similar to how we handleencoder.py
. The batch script should ensure the environment is set up correctly and the script is executed with the appropriate arguments.Describe alternatives you've considered
scikit-learn
is more efficient and easier to manage.Additional context See
encoder.py
for an example of how our Python scripts work. We call our Python scripts via a batch file like this for our encoder:encoder.bat:
setenv.bat:
activate.bat:
We also need to update our
data load
terminal command in our runtime. This will run after we extract the vocab but before we build the training and evaluation data tables.