Build a database of the audio clips for later packaging into the Kaldi data structure used to train the vocoder

ngtban commented 3 years ago

I need to build a database of the audio clips and figure out the speaker of each clip.

Generally the name of each asset follows this format:

alternative-(Alternative Number)-(Character Name/Skill/Object under Examination)-(Location of Conversation) (Point of Focus) -- (Alternative Marking)-(Conversation Node)

The "alternative" prefix, alternative marking, location of conversation are optional.

ngtban commented 3 years ago

Some observations about the dataset:

Sometimes there are no location indicator, this happens when the speaker stays in a single place.
Speech is recorded in 48khz, while sound tracks seem to be in 44.1khz? There are only two tracks in the mix, and their names both start with a number.
There are clips that do not correspond to any speaker, those are sound effects. They seem to be all named in all lower case.
One problem is that the dialogue pieces corresponding to the narrator are spread among locations and objects under examination
I need to stitch the dialogue text with the dialogue node somehow. Disco Reader source might be of help.
The audio clips extracted from Asset Studio loses the metadata pieces other than name, however.
Asset Studio does not seem to offer anykind of API?
Sometimes the devs break their naming convention, and a piece using the narrators voice end up under a character's dialogue. One example is "Dolores Dei-DREAM SEAFORT DOLORES DEI-102".
There is around 61.2 GB of audio data. I wonder how much of that are narrator's clip.
Some of the clips are silent, for example Dolores Dei-DREAM SEAFORT DOLORES DEI-475 and Dolores Dei-DREAM SEAFORT DOLORES DEI-477.

ngtban commented 3 years ago

More observations after extracting data from the dialogue bundle

Dialogue entries:

Those with actor being 0 do not have dialogue text, so I don't have to care about them.
I can tell who the actual speaker is based on whether the dialogue text is put in quotes or not.
Some dialogue entries beloging to a speaker other than Harry or the narrator are not wrapped in quotes, however.
Entries seem to always have the correct actor labelled.
Dialogue entries having a book, a skill, an inanimate object as their actors are always narrated.

Actors ids

1 to 145 are humans
146 to 153 are inanimate objects
154 to 204 are books
205 to 386 are again inanimate objects
387 is Harry
388 is branch marker?
389 to 420 are skills

The audio clips themselves

There are audio clips that have a corresponding dialogue entry, but the dialogue text is empty. Those probably are legacy. One example is the dialogue entry with id = 981, conversation_id = 995.
Some audio files do not correspond to any dialogue entry. One example is Inland Empire-WHIRLING F2 DREAM 2 INTRO-40
The transcription for thoughts actually lie in the description column of the items table.
I can't find the transcription for the joke (newspaper) endings anywhere in the dialogue data.

ngtban commented 3 years ago

Audio clips whose corresponding dialogue entries have no dialogue text

There are around 640 dialogue entries that should have a corresponding dialogue text, as they each have an audio file associated with them. To get the transcription for those entries, I would need to navigate the conversation graph and follow the conditions checking logic for each node, which I think is rather complicated and not worth the time. I believe the rest of the audio clips are enough to train the vocoder.

ngtban commented 3 years ago

I'm seeing cases where audio clips are marked by incorrect actors. One example is the audio clip named Kim Kitsuragi-YARD TRASH-390. Node 390 in the conversation has its actor id being 215 (which is the Trash Container), but the actual speaker is Kim!

ngtban commented 2 years ago

Looks like the writers forgot to add matching quotes in some dialogue entries. An example being the de with conversation id 23, dialogue entry id 662.

ngtban commented 2 years ago

I found where the joke endings are stored: they are all within a conversation with id 1427. None of the dialogue entries have text, the content is stored within scripts instead. Might need a mini parser.

ngtban / wavenet_de_data_prep