Closed anselm closed 2 weeks ago
Thanks, Anselm. I'm not against splitting things up in principle. However, for reusability and modularity, wouldn't it be better to try to split the class into separate reusable classes (using OOP principles)?
In this PR, individual properties (templates) have been separated from their related functionality (factories), but the templates themselves are not really reusable as they depend on factories that are able to encode them and generate the actual animation sequence or pose. Of course, besides reusability, there might be some other reasons to separate the templates, for example, to replace them with one's own, but in the most likely scenario the app wants to keep the existing templates and just add its own custom poses and moods.
That said, I'm not sure how many reusable classes it would be possible to extract from the TalkingHead class. When the system to be modeled is complex, the model needs to reflect that complexity. Often, this means that you need to add more interdependencies and connections, which makes the splitting and reuse more difficult. There is already a lot of overlap between moods, poses and animations, and for the avatar to act more realistically, this overlap should be increased, not decreased.
Before this, I didn't have any plans to divide the class, so I don't have a specific class/component diagram in mind. I need to think about this some more. - If you have any ideas about how to split the class into several classes, I would very much like to hear your thoughts.
I should also point out that the TalkingHead class doesn't use any LLMs. And if you would like to use a different TTS, the class has an interface for that integration. There are already projects/apps using the class with Google, Microsoft, OpenAI, and ElevenLabs TTS. Sure, Google TTS support is there as a default for simple web projects, but it doesn't have to be used, so no need to swap anything out. Replacing or abstracting away Three.js would be difficult because function calls to it cover such a big part of the code.
The 'TalkingHead' class should represent one talking avatar. For multiple avatars to share a scene, there should be several instances of the TalkingHead class sharing a scene. In some sense, if you ignore the naming, this PR moves in that direction by separating the camera and lights from the actor, but I don't think inheritance (extension) is the right way to relate the scene and the actors as the relationship is one-to-many. I'm not even convinced that the scene should be in the scope of the project (except in the same sense as Google TTS).
I really appreciate the time you have spent on this, and I hope that my comments don't discourage you. This is just a hobby project for me, but I used to work as a software architect for many years, and old habits die hard.
Thanks for your incredibly detailed and thoughtful response. I'm definitely open to other ways to split this up. Note I'm not at all attached to subclassing as a pattern - it was just a form of 'origami' - reorganizing the code without breaking any of the existing outside code. I do agree it is reasonable to concentrate moods, animations and poses into one 'module'
It does feel like it is worth trying to find the right approach, if even a different approach, because I do see this library as a best of breed. In particular the phoneme to viseme work feels accurate and visually satisfying. I could see this becoming a defacto foundation or community resource that everybody supports, even commercial organizations. I just wanted to try to get there stepwise - rather than breaking everything or massively refactoring in one step.
I'm imagining there is a larger community that may join in and have strong opinions as well. There are probably many indie game developers, hobby programmers and even commercial developers who would love to have a drop in embodied agent. It does feel like there is a large potential community. These people typically have their own startup logic for building a scene, often (at least aspirationally) are looking to build mulitiplayer experiences (where a digital agent like this has to have its actions multicast across a network).
Generally speaking, right now index.html makes an instance of Talkinghead and then that acts as an interface for many capabilities. It's actually nice for index.html to have a single handle on a single object that it can use to drive the scene - but that concept isn't a TalkingHead - its a super set of a talking head... it includes a renderer and a few other things. What I moved to TalkingBase is more of a focus on the driving of the actual animation engine, but even then it's not exactly the approach I would take if I was refactoring.
From my point of view in any case I see these parts (if I was going to try a larger refactor in one go):
1) Reasoning 2) TTS 3) Playback engine (Talkinghead) 4) Renderer
Here's my thinking on this:
1) A 'reasoning' set of capabilities. These are largely outside of the library itself (as you have done in fact) and focus on the usual topics such as providing UX for a user text prompt and piping it to some kind of llm or reasoning module and getting an 'utterance' to speak. It is nice to include some interfaces to talk to various services. Largely this is probably ok as is (largely embodied in your index, minimal and mp3 html files). I would probably move these out of index.html into some set of extras or features or some kind of pulled in resource that people can re-use.
2) A 'TTS' set of capabilities. TTS is special for a couple of reasons. One key reason is that it has to break up the utterance into sentences and then pass each fragment to the playback engine as fast as possible. That means that what was a single flow of activity arriving at the TTS is now multiple events coming out of the TTS. This queue should be built as fast as possible, accumulating prior to the first sentence being fully played back
I do feel that the first-pass fragmentation of an utterance, and turning that utterance into sentences, and then passing each sentence to a TTS module should be done outside of the TalkingHead component itself, or in some kind of server side or wrapper. I think the TalkingHead component be fed packets/datagrams or messages which it can queue for playback (as you do now) but the TTS itself should be separated code-wise from TalkingHead itself. It exists in a 'server side' wrapper of some kind - that is tightly coupled to the playback engine but isn't the playback engine itself.
Note that for TTS I've been using a customized version of coqui which can output phonemes at timestamps as well as a wav file for the audio ( https://github.com/coqui-ai/TTS) and I've also tried amazon polly, google tts, deepgram and others. In a multiplayer game typically these operations are done on the server and then multicast to all instances.
The output of the TTS should be a series of datagrams or messages or method calls passed to the playback engine. I imagine that includes { sentences with timestamps per work, or alternatively phonemes at timestamps, a chunk of audio to playback, the text of the sentence being uttered, an interaction identifier, a sequence identifier, a total audio duration is a nice to have also }.
As a lower priority it might also make sense to support the built in tts in the browser (although it has a couple of serious defects in that only one voice can be played at a time which precludes having multiple digital agents speaking at once, and it cannot return an estimate of total duration). In that light however it might make sense to also include the utterance text of the sentence in the datagram.
3) A ''playback' set of capabilities. In my mind a client is a playback service; it is driven by datagrams / events / messages arriving from the outside world. Right now the playback engine itself calls out to the TTS and I feel that should be separated out.
4) Renderer setup and orchestration. Right now the talking head module itself starts up threejs. While this is trivial for any third party to move
We do see that the core service "Talkinghead" would like to know where the viewer is (the camera in this case) and it also would like to know the scene boundaries (for fitting the puppet to the screen). It might make sense to move fitting out of the core, and it might make sense to explicitly tell TalkingHead where the viewer is (although I think it's harmless to pass it an volatile object with a constantly changing location). In a multiplayer game the viewer won't be the camera of course.
Very interesting. I have no real experience in multiplayer game development, but it seems to me that there are many different ways in which one might implement a game engine and many ways to split the responsibilities between the web clients and the server. It would be a mistake, I think, to make any changes to the TalkingHead class without first having the game engine and knowing its architecture.
I know that Ready Player Me supports game engines such as Unity and UE, but do you know of any existing multiplayer game engines supporting Three.js?
On their site, Ready Player Me seems to have several multiplayer games that run on a browser. They are using a game engine/SDK called Hiber3D. Any thoughts about it?
P.S. The index.html
is just a test app for the class. The TalkingHead module/class is what the project is all about.
I like RPM for sure - we use it in Ethereal Engine which is in fact a 3js multiplayer metaverse game world - it has a data driven architecture and uses ECS. I haven't tried Hiber myself - I do know of several of these engines - many of which are quite interesting.
The layer cake approach I used to split your code is really just an attempt at a minimally invasive change and the large single file seemed to split surprisingly well into those separate concerns. My thinking was that a more elegant refactoring could be done later. Also Index.html and minimal.html and so on don't need to change at all which is nice; effectively the API surface is untouched.
As far as a best refactoring - although I've written several video games I don't know that there's a canonical or best practice tbh. I am leaning away entirely from OOPS and more recently fond of the idea of reactivity, or data driven architectures, where there's a largely declarative framing for the work, and code modules tend to be message handlers rather than using forward imperative programming. I'm also fond of ECS patterns to some degree as well. I like Erlang Elixer and I like the ideas in this book: ttps://www.dataorienteddesign.com/dodbook ... but - honestly there are new ideas and new approaches all the time. It would take a group analysis by several people to really figure out the best patterns. I agree it would take more thought.
I basically made a couple of changes:
1) In the layer cake approach I did move the 3js startup to one file. Now the main talkinghead.mjs file in my version of your repo is the only thing concerned with starting up threejs machinery itself. It makes it easier to stuff into a game or application (as I have done).
2) I also saw a reasonable separation between lower level animation driving (body and face physical movement) from 'articulation' (speech synchronization). In fact I just now went a bit further and I just commited a change that separated audio itself out into a separate layer.
And as a result I can now use your work in a few different projects more easily. I'm using it in a couple of projects now:
In Orbital (an agent driven framework) I use it here: https://github.com/orbitalfoundation/orbital/blob/main/puppet/observer-performance.js . On my server side I deal with openai, tts and also run backwards through whisper (stt) to get word timing information (I suspect OpenAI will offer a single shot solution here at some point). I then throw those packets down to the clients and they play them back. In this project I am building a multiplayer world where people can talk to each other - synchronization between all clients was important...
And in Ethereal Engine (a threejs multiplayer game world) I use it here: github.com/anselm/ee-npc ... I'm trying it out here instead of my own earlier pipeline (which was using coqui for tts with viseme generation and a simple viseme player on the client side). In this use case the body animation system you've built cannot be used (since they have their own system) - also their engine uses realusion rigs and further are processed via VRM - so their lower level rigging and morph target format is totally alien - but at the same time it isn't hard to retarget to VRM - and I've successfully driven those rigs as well. Ultimately I will have to better separate driving only head and face visemes / morph targets - but I haven't gotten there yet...
Note on these last changes I haven't done regression tests against your original functionality so I am not 100% sure my version of the repo is 100% identical to yours in functionality. I still need to really test it down to the metal to make sure I didn't break anything. Like the bottom parts of the stack work but I haven't 100% verified the top parts.
I browsed through all your changes, and I'm glad you have found some parts useful.
The big issue here, I think, is that talking heads and multiplayer games are different use cases and have different functional and non-functional requirements. For example, they involve one versus multiple avatars, one versus multiple personalities, different camera angles, framing, and viewing distances, interaction with the user versus avatar-to-avatar, different user dynamics, etc. Furthermore, game engines impose various limitations on implementation, timing, concurrency, use of computational resources, etc.
Unlike Ready Player Me, which focuses on games, the TalkingHead project, as the name suggests, focuses on the "talking head" niche — a form of presentation where an individual's head and shoulders are displayed. The only reason I ended up using a full-body avatar was to simulate lower body movements (e.g., shifting weight from one leg to another) to make upper body movement seem more natural.
That said, I have no intention of extending the scope of the TalkingHead project to include the use cases of multiplayer games, as that would introduce more functional and non-functional requirements, add complexity, interdependencies, and lead to compromises. This means the TalkingHead project will not address all the functional and non-functional requirements of multiplayer games that you are probably aiming at. From your perspective, if multiplayer games are your focus, that's a problem.
Now, you can continue to develop your own branch, but I suspect it will move further and further away from the main branch. Since you clearly have a strong vision and needed skills, my suggestion is that you take what is useful and start a new project that aligns better with multiplayer game use cases.
I think the above is also related to how and why you have split the class. It may serve your purposes, and that's great, but as it is now, I don't think it adds value to the TalkingHead project. I also think the way you have done the split mixes inheritance (is-a relationship) and composition (has-a relationship) in a mistaken manner, but, as you said, there are different design philosophies and preferences.
I also briefly looked at the Ethereal Engine (thanks for the link). I should, again, point out that I'm not a game developer, and didn't look at the source code, so I might be mistaken, but based on the examples, the engine already has most of the things in place: animations, audio, etc. What is missing - related to TalkingHead class functionality - is basically just the body language in interactions (including lip-sync and facial expressions).
Body language is just another "language" like English and French. In the long run, GPT-like models will output animations as well as any other language. I have already tried such products. In the short term, however, OpenAI and the new GPT-4o probably has the best voice quality, but no timestamps, body language, blendshapes, or visemes, so you'll probably need the Whisper, a personal body language script for each game character (similar to poseTemplates, animMoods), and "factories" that can turn templates/text/audio into the animation format that the game engine supports (similar to lip-sync modules, poseFactory, animFactory).
Closed PR due to inactivity.
At the moment the talkinghead constructor initializes threejs and also talks to third party services. This reduces the reusability or modularity of code. A more modular pattern would (for example) make it easier to have multiple avatars in one scene, or to swap out different llms or tts capabilities. If these parts can be separated to some degree then it may become possible to use talkinghead in third party experiences such as 3d games, or in situations where you want multiple ai driven puppets. This PR is a first step in that direction to test the waters and see what Mika thinks of the idea of splitting things up.