ngtban/wavenet_de_data_prep

What does this repository do?

This repository is for extracting asset data, namely voice clips and their corresponding transcriptions, from the game Disco Elysium, specifically the "Final Cut" version, and reformat the extracted data into a format that ESPnet understands and could use to train a vocoder.

My goal is to at the very least 1) have a vocoder using WaveNet with the characteristics of the narrator in the "Final Cut" version of the game, and 2) package and publish the vocoder as a mobile app, as the open source ones I found so far are not really great.

To these ends, I intend to have three repositories:

One to extract the dialogue data, audio clips, and match them together in a format understood by ESPnet for training, which is this repository.
One dedicated to problems that arise when training the vocoder.
One (or maybe two for each currently dominant mobile platforms) for the packaging and publishing of the vocoder on the mobile platform.

I put the code for preparing the data into Mix tasks. You can check them under mix/tasks to see the details.

Why?

I love the game and the voice of its narrator, and perhaps out of vanity I think I could do better than current open-source text-to-speech solutions available on mobile platforms.

Cautions

This project is written with the Final Cut version of the game in mind, specifically version 2832f901, released on 2021-04-19. I cannot ensure the correctness of the app for earlier or later versions, in fact I have tried using this repository on a later version and things no longer work. For now, to use this repository you will need to use a program to download version 2832f901 of the game, for example DepotDownloader.

Please also note that you will need around 65GB of free disk space to store the extracted audio clips.

Note on work in progress

So far I have only completed two mix tasks doing the following:

Extracting conversation, dialogue entry, actor, and item data from the dialogue bundle.
Matching the extracted audio clips with the extracted dialogue entries.

I still need to implement two other mix tasks doing the following:

Converting the matches into a csv file following the LSJ format for training.
Putting everything into a single place so that with a single invocation we can generate the csv file needed for training.

If you still want to check out the finished mix tasks then please follow the instructions for setting up the repository and running those task in the sections below.

Getting the project up and running

Should you wish to try out the code in this repo, please follow the instructions in the sections below:

Prerequisites

You should have these installed:

Elixir 1.14.0
Erlang OTP 25.1
PostgreSQL 13.3

I cannot guarantee that the code works for lower versions of the applications listed above.

Please also make sure that you have a around 65GB of free disk space for the audio clips.

Fetching dependencies

Create a database.exs file under the folder config of the repository. The content of the file should look like this:

import Config

config :data_prepration, Elysium.Repo,
  database: "elysium",
  username: "<Your Database Username Here>",
  password: "<Your Database Password Here>",
  hostname: "localhost",
  log: :info # Change this to false to mute ecto debug logs. Keep it otherwise.

Then run mix deps.get to install dependencies of the project. Note that the file database.exs is necessary for setting up the database as well.

Setting up a database connection

Make sure that you have created a user within PostgreSQL using the credentials in the file database.exs. Then run these commands to setup the database:

mix ecto.create
mix ecto.migrate

Using `mix` tasks to prepare the extracted data for training

Extracting dialogue data and audio clips from the asset files

You will need to use Asset Studio to extract data from the asset files. Please purchase a copy of the game. I can give you a copy of the extracted data and the generated database as well if you cannot buy the game for some reason.

Extracting the dialogue bundle from the assets of the game

Locate your local installation of the game.
Open Asset Studio.
Load the file at <game root>/disco_Data/StreamingAssets/aa/StandaloneWindows64/dialoguebundle_assets_all_<some hash>.bundle.
Export all the assets you see in Asset Studio. There should only be one asset containing the bundled dialogue data.

You should see the folder MonoBehaviour within the location you chose in step #4.

Extracting audio clips from the assets of the games

Please make sure that you have the free disk space needed to store the audio clips. You should have around 65GBs of free disk.
Open Asset Studio.
Load the folder at <game root>/disco_Data/StreamingAssets/aa/StandaloneWindows64/.
Filter the asset by type, make sure that only AudioClip is checked.
Export the files to a folder of your choice. It will take a while.
You should see a new folder AudioClip within the folder you chose that contains all of the audio clips.

Extracting conversation, dialogue entry, actor, and item data from the dialogue bundle

Run this command:

mix prepare_bundle <path to the dialogue bundle json file>

For example:

mix prepare_bundle '/extracted_assets/MonoBehaviour/Disco Elysium.json'

After running this task, you should see that the database configured in the file database.exs is populated with conversation, dialogue entry, actor, and item data.

Matching the extracted audio clips with the extracted dialogue entries

Run this command:

mix label_audio_clips <path to the folder containing the audio clips>

For example:

mix prepare_bundle '/extracted_assets/AudioClip'

After running this task, you should see the configured database is populated with audio clip metadata, in the table audio_clips.

Feedback

If you are interested in contributing or reporting bugs, please check the issue list. Constructive feedback is appreciated.