Implement PCA Reduction Script and Batch Integration for Embeddings

Is your feature request related to a problem? Please describe. Currently, we need a script to perform PCA on our embeddings and reduce their dimensionality. This will help in managing our high-dimensional vector space more efficiently.

Describe the solution you'd like A Python script named reduce.py that:

Takes a JSON file containing vocabulary and embeddings as input.
Performs PCA to reduce the dimensionality of the embeddings.
Saves the reduced embeddings to a new JSON file.

Additionally, we need to create a batch script to call reduce.py in the correct virtual environment, similar to how we handle encoder.py. The batch script should ensure the environment is set up correctly and the script is executed with the appropriate arguments.

Describe alternatives you've considered

Implementing PCA directly in C#, but using Python with scikit-learn is more efficient and easier to manage.
Using incremental PCA, but our dataset size allows us to process it in one go.

Additional context See encoder.py for an example of how our Python scripts work. We call our Python scripts via a batch file like this for our encoder:

encoder.bat:

@echo off
call "%~dp0setenv.bat"
call "%ACTIVATE_SCRIPT%" >nul 2>&1
python "%~dp0encoder.py" %*

setenv.bat:

@echo off

set "PROJECT_ROOT=%~dp0\.."
set "TEMP_DIR=%PROJECT_ROOT%\.temp"
set "PYTHON_INSTALLER=python-3.10.11-amd64.exe"
set "PYTHON_DIR=%USERPROFILE%\.python\Python310"
set "VENV_NAME=ml-agents"
set "VENV_DIR=%PROJECT_ROOT%\venv\%VENV_NAME%"
set "ACTIVATE_SCRIPT=%~dp0activate.bat"
set "DEACTIVATE_SCRIPT=%~dp0deactivate.bat"
set "CLEAN_SCRIPT=%~dp0clean.bat"
set "UTILITIES_SCRIPT=%~dp0utilities.bat"
set "ML_AGENTS_DIR=%PROJECT_ROOT%\ml-agents"
set "ML_AGENTS_ENVS_INSTALL=%ML_AGENTS_DIR%\ml-agents-envs"
set "ML_AGENTS_INSTALL=%ML_AGENTS_DIR%\ml-agents"

activate.bat:

@echo off
call "%~dp0setenv.bat"

echo Activating virtual environment...
call "%VENV_DIR%\Scripts\activate.bat"

if %errorlevel% neq 0 (
    echo Virtual environment activation failed.
    exit /b 1
)

echo Virtual environment activated.

We also need to update our data load terminal command in our runtime. This will run after we extract the vocab but before we build the training and evaluation data tables.

p3nGu1nZz / Tau

Implement PCA Reduction Script and Batch Integration for Embeddings #7