VR mode - Githubissues

Giszmo commented 4 years ago

Is your feature request related to a problem you are facing? With bigger groups of people who just want to casually hang out, coordination of who gets to talk and who you have to listen to become awkward and you quickly need moderation and the bigger the group, the less time to talk there is for each individual but the human is equipped with remarkable capacity to process multiple sources of sound if they are directional and focusing is possible.

Describe the solution you'd like Just like optional tile mode, provide a Virtual Reality mode that allows to control orientation and distance to other participants.

In its most simple version, the low-res videos of all participants would be in a large square and you can pan to any position with sound attenuated according to distance of other speakers. This would allow to join sub-groups while still noticing when other groups are having more fun than the current one.
Add directional sound projection. You would better be able to separate voices and orient according to what you want to listen to.
Render avatars with video stream projected on them, rendered in a 2.5 or 3D environment. Screen sharing projected on walls, ...
Add orientation sensor support to turn by turning the device

Describe alternatives you've considered There are alternative products focused on VR but I'm not aware of any that would combine a VR environment with video streaming. Also there is no good open source project I'm aware of if video streaming was not a requirement.

cianoscatolo commented 4 years ago

Let me remind you: this is run through a browser in most cases. Jitsi is supposed to be a lightweight videoconference tool. If you want VR, I'd suggest VRchat, your idea looks quite unfitting to me

Giszmo commented 4 years ago

I want VR mostly for the audio positioning, not to explore beautiful interactive landscapes together. Therefore the most basic version which would attenuate the volume based on the distance to other participants would already be a great feature and simple to implement I assume. The orientation of each participant could again massively enhance the audible separation of sources of sound but I agree it would be kind of not an obvious feature for most who are currently using and enjoying jitsi.

These two parameters, position and orientation of participants could be completely isolated from the video stream, just to control which audio stream to hear at which volume on which ear.

Without the 3D room, it would feel a bit like a geeky hack but for web I agree the 3D part might be a bit too much.

jamesodhunt commented 4 years ago

I came across this issue as I was looking for collaborative high quality wideband audio for singers / instrumentalists. Jitsi looks like the best option for many reasons not least of which being the Opus codec. In fact, it would be perfect... if it was possible for the meeting host to adjust volume levels of individual participants and their stereo placement.

Basically, I think what I'm after is combining Jitsi meet with SoundScape Renderer. These pictures show exactly what I'd like to achieve with Jitsi - the arbitrary placement of individual meeting participants and be able to adjust their volume and/or mute individuals (or ideally also apply mute / volume levels to groups of participants):

See also:

jamesodhunt commented 4 years ago

.. and yes, the reason for this is due to the current global situation, so if this were somehow possible, it would make a lot of musicians very happy :smile:

Giszmo commented 4 years ago

While I'm happy to see support for this issue, I doubt jitsi or any remote collaboration tool will be suited for a chorus or orchestra to play together in real time due to latency. Voice over IP has come a long way but it's still notorious for latency issues where people talk over each other because they hear each other half a second late. For a chorus, 30ms of latency would lead to slowing down in order to adapt to the delayed perception of the others, which in turn would slow down, too ... In fact I was fascinated last week when we tried to sing a "Happy Birthday" in a group chat: No matter how slow we in my room got, the others were even slower to sing. While our latency was probably more like 200ms, the same effect would play out with anything in double digit ms delays.

Giszmo commented 4 years ago

The interface your links show are close to what I had in mind though. Only with each to be able to move himself aka all others as a group.

jamesodhunt commented 4 years ago

Hi @Giszmo - you are right, latency is going to be the biggest issue for performers. In fact I've just heard a recording of a musical group trying to perform together remotely. It unfortunately sounded truly terrible due to latency issues. I'm planning to experiment on a very fast network but even then, I suspect it will still be "bad" (however, I'd like to hear exactly how bad :smile:)

Musicians aside, "3d placement" of individual meeting participants would be a very useful feature.

lalalune commented 4 years ago

Hi all, @jamesodhunt, @Giszmo --

So there's a thing called Aframe. It's an entity-component system built on Three.js. There is a networking library on top of that called Networked-Aframe, which abstracts a basic messaging system into Aframe's entity component system. There are multiple adapters for Networked-Aframe (NAF), including one for Janus by Mozilla Hubs' team (who is using it for exactly what you all are talking about), as well as one for socketio and vanilla p2p WebRTC, so on and so forth.

There isn't an adapter yet for Jitsi -- but it would be great if there was. We're trying to build it now.

Why does this matter? Because it doesn't just work in VR. It works on your phone, on almost any device, you can set it up for an overhead view -- any way you slice it, you can make decisions on how loud people are and who to send packets to based on their position in space, a decision they can make in arbitrary input -- and I think this is a big deal. It let's us build virtual conferences, even flat ones, because we can make decisions about how they should be connected and who should send (or ideally, not send) packets to whom based on easy to understand heuristics for both user and developer.

(And not to be crass, but we would pay a reasonable hourly to a capable developer who wanted to help us build this and release it for free into the community!)

I've seen some successful projects get developed via Github threads. If ya'll want to start working on this, I am going to get started on a repo :)

udexon commented 4 years ago

I am new to Jitsi.

I have started a project on 3D VR AR Conferencing recently:

https://github.com/udexon/Phoom

Excuse the name for I am sure you notice the similarity to a famous app.

I believe the problem that we need to solve is quite simply the following:

i. Separate foreground (human face or body) image from background (static)

ii. Since most video encoder will do this in P-frame or B-frame or related algorithms, what we need to do is to identify the code in Jitsi that does this.

iii. At the display / rendering end, render the extracted foreground image as avatar in a 3D view.

iv. In my repo, I have included links to OpenCV WASM implementation, which I think MAY help.

Collaborators are welcome.

What I need help from you guys is this:

find the code in Jitsi that does the foreground / background / P-frame / B-frame separation.

Thank you very much.

Giszmo commented 4 years ago

@udexon I guess you are shooting way beyond the goal of this issue. As described above, it's primarily about introducing audio projection and all the UI/visual changes are merely to help with orientation in a 2D space.

udexon commented 4 years ago

@udexon I guess you are shooting way beyond the goal of this issue. As described above, it's primarily about introducing audio projection and all the UI/visual changes are merely to help with orientation in a 2D space.

Thank you very much for your response.

I'm not sure what you mean by "beyond the goal".

My understanding is that no one seems to be working on a "VR mode" related to Jitsi or independently.
Whatever you mean by "VR mode", it should comprise visual and audio changes.
I am working on the visual part of the VR, because ultimately, this is the low level foundation on which audio interaction may need -- e.g. stereo sound, direction of sound, loudness change due to distance.
If you are primarily interested in audio, perhaps you can suggest collaborators in visual. Let us solve the visual problem first or independently.

Do you agree with the above?

Giszmo commented 4 years ago

@udexon to my understanding, the jitsi team is not excited about changing much for a VR mode. Therefore I see a bigger chance for acceptance if it just adds some map of the "room" to the screen and a way each participant can position himself in that room, for audio to be attenuated and positioned according to position and orientation of both source and sink of audio.

Your feature of removing the static background from the video stream could be nice on its own, so I don't need to clean up my room before joining the call :D but it's not anything specific to the issue here. Else you could also bring up that faces should be scaled to be equal in size, the skin toning should compensate the different lighting conditions, the ... those are very advanced topics that I don't care about.

I want to have a way to have meetings on jitsi such that I can be in the same room with 40 people but with maybe 6 talking at any time without disturbing each other, just like it would happen at a party in my living room. I want to position myself next to the people I care most about but still remotely hear that others also talk in the room.

In its most simple implementation, that "room" would have all participants in the center and by clicking somewhere, I would jump there and would hear all others attenuated according to distance. Then I would add a control for my own orientation and map the stereo attenuation to the orientation. Then I would attenuate speakers that speak away from me. Then, depending on what the audio library is capable of, add phase shifting to better project sound (audio waves from the left, hit the left ear not only louder but also first).

That for me would already be an awesome feature, allowing for Jitsi parties with more participants than is enjoyable currently.

udexon commented 4 years ago

@udexon to my understanding, the jitsi team is not excited about changing much for a VR mode. Therefore I see a bigger chance for acceptance if it just adds some map of the "room" to the screen and a way each participant can position himself in that room, for audio to be attenuated and positioned according to position and orientation of both source and sink of audio.

Your feature of removing the static background from the video stream could be nice on its own, so I don't need to clean up my room before joining the call :D but it's not anything specific to the issue here. Else you could also bring up that faces should be scaled to be equal in size, the skin toning should compensate the different lighting conditions, the ... those are very advanced topics that I don't care about.

Free software programmers work independently. But then of course we must consider the dependency of modules that one decides to work on, with respect to others.

"removing the static background from the video stream" is not too complicated if you have seen OpenCV examples.

The difficulties that I am facing now is to identify the code inside Jitsi that does the above. It may be a trivial task for someone who is familiar with Jitsi. Hence I ask the question here, so that I might get a quick answer.

I want to have a way to have meetings on jitsi such that I can be in the same room with 40 people but with maybe 6 talking at any time without disturbing each other, just like it would happen at a party in my living room. I want to position myself next to the people I care most about but still remotely hear that others also talk in the room.

If you have read some of the OpenCV tutorials concerning this problem, you may have a better idea of how much time is required to do this.

In its most simple implementation, that "room" would have all participants in the center and by clicking somewhere, I would jump there and would hear all others attenuated according to distance. Then I would add a control for my own orientation and map the stereo attenuation to the orientation. Then I would attenuate speakers that speak away from me. Then, depending on what the audio library is capable of, add phase shifting to better project sound (audio waves from the left, hit the left ear not only louder but also first).

What you just mentioned above are "high level" functions that are dependent on the "low level" functions (separating foreground avatars from background).

With the considerations above, perhaps you might want to revise some of your expectations, especially on project management -- who should do what when, etc.

Again, my apologies if I sounded patronising -- I just tried to be technically accurate and neutral about what we want to achieve.

Giszmo commented 4 years ago

@udexon you checked my GitHub profile and decided I was not a coder despite having more than 1000 commits last year. Mhm. Ok. Yes, you are patronizing and probably should review the issue you are commenting on. The very first reply was basically VR - no way and you try to make things more complicated than what I try to achieve. Get out of my issue. Open your own issue. Yours is interesting but unrelated.

udexon commented 4 years ago

@udexon you checked my GitHub profile and decided I was not a coder despite having more than 1000 commits last year. Mhm. Ok. Yes, you are patronizing and probably should review the issue you are commenting on. The very first reply was basically VR - no way and you try to make things more complicated than what I try to achieve. Get out of my issue. Open your own issue. Yours is interesting but unrelated.

My apology. I will delete my comments concerning your skills.

you try to make things more complicated than what I try to achieve.

I am seriously not sure what you mean.

How much experience do you have with OpenCV?

Calling OpenCV "complicated" is a big alarm ....

Perhaps you are not familiar with the field, given your contributions to cryptocurrency related repos?

saghul commented 4 years ago

Folks, please keep it about the code. If you enter in personal arguments I’ll have to lock the issue.

udexon commented 4 years ago

Folks, please keep it about the code. If you enter in personal arguments I’ll have to lock the issue.

Noted.

TLDR: I proposed a solution to OP's problem using OpenCV.

OP and I are still not in agreement if my solution is useful.

Perhaps others may vote on it?

lalalune commented 4 years ago

@udexon what you are looking for is volumetric video. Check out Depthkit -- they have a example somewhere of streaming with Vimeo. If you use a greenscreen you can get a very good FPS using chroma extraction as a preprocess. https://github.com/vimeo/vimeo-unity-sdk/wiki/Streaming-volumetric-video-captured-with-Depthkit

If you really want to go nuts on something, some friends are optimizing the 4d views player so it'll run a bit better in VR: https://www.youtube.com/watch?v=NXpYeQ_aaTg

@Giszmo : you are looking for positional audio culling. Someone has done exactly what you're looking for, with Jitsi, just this last few weeks: https://github.com/capnmidnight/Calla

However, if you're interested, we're looking to do the same thing using Aframe and probably Resonance audio, so we can do "speakers" mode where the audio is fairly even across the room for a perforrmer or speaker, and a regular user mode where people get quieter as you walk away.

A good example of this is Mozilla Hubs. We're kinda doing that, but we're doing a lot more social MMO features on top (and a more modern ECS data driven approach with graphql).

@Giszmo If you're interested in integrating Jitsi into WebXR wiuth Threejs / Aframe and you're a capable coder, I would absolutely work with you on that and could even get you a little compensation for your hours, as long as the end result is free and open source :)

lalalune commented 4 years ago

Also @udexon check out my repo here: https://github.com/shawticus/facemesh-threejs it's just a basic starter kit with a UV-mapped morphable face model from tensorflowjs set up for threejs, next step is to project webcam onto the mesh and send the cropped camera image and current mesh pose via WebRTC channels :)

udexon commented 4 years ago

Also @udexon check out my repo here: https://github.com/shawticus/facemesh-threejs it's just a basic starter kit with a UV-mapped morphable face model from tensorflowjs set up for threejs, next step is to project webcam onto the mesh and send the cropped camera image and current mesh pose via WebRTC channels :)

I think we are thinking of the same thing.

Still, Jitsi is a mature platform, with LOTS of users and mature code base. It will be good if we can just get the stream. That is the reason I call my project Phoom -- a better Zoom, with AR.

I have spent the past 24 hours looking through Jitsi code.

I might need another 48 hours doing this alone.

If someone familiar with Jitsi code base could help, it may just cut down the chase to 20 minutes!!

Thanks in advance!!

Update:

I hope this is the "bottom" lowest level code. Things start to make sense now:

https://developer.mozilla.org/en-US/docs/Web/API/MediaStream

https://github.com/jitsi/jitsi-meet/blob/master/react/features/stream-effects/presenter/JitsiStreamPresenterEffect.js

Update 2:

Now I am looking at:

https://github.com/jitsi/jitsi-meet/blob/master/react/features/base/tracks/middleware.js

MiddlewareRegistry.register(store => next => action => {
....
case TRACK_UPDATED:
        // TODO Remove the following calls to APP.UI once components interested
        // in track mute changes are moved into React and/or redux.
        if (typeof APP !== 'undefined') {
            const result = next(action);

            const { jitsiTrack } = action.track;
            const muted = jitsiTrack.isMuted();
            const participantID = jitsiTrack.getParticipantId();
            const isVideoTrack = jitsiTrack.type !== MEDIA_TYPE.AUDIO;

            if (isVideoTrack) {
                if (jitsiTrack.type === MEDIA_TYPE.PRESENTER) {
                    APP.conference.mutePresenter(muted);
                }

Anyone cares to share any ideas / documentation about this part?

Seems like a critical part of the app.

lalalune commented 4 years ago

@udexon following a middleware pattern--

APP is your whole app. In express, when you run APP.configure(someMiddleware), it adds that to your app object, and now it's accessible via APP.someMiddleware

This function is much the same -- it registers the middleware to handle events when things pass through it, for example a request or subscription.

MiddlewareRegistry.register(store => next => action => { // add middleware to router

The different "cases" are the things that can happen to the request. If we receive a TRACK_UPDATED event, run this logic and pass it to the next middleware in our server.

case TRACK_UPDATED: // parse event action TRACK_UPDATED and do some logic

When you see something like this--

 if (localTrack
                && (jitsiTrack = localTrack.jitsiTrack)
                && jitsiTrack.getCameraFacingMode()
                    !== action.cameraFacingMode) {
            store.dispatch(toggleCameraFacingMode());
        }

note the 'jitsiTrack = localTrack.jitsiTrack' is an assignment, not an equality check, so your jitsiTrack cached variable gets populated right there.

lalalune commented 4 years ago

@udexon send me an email to shawmakesmagic@gmail.com and I'll give you an example of streaming tensorflowjs with sockets + jitsi that mostly works. If you want to work together to clean it up and make it something we can put out into the world, I'm all about it, as long as it's FOSS :)

udexon commented 4 years ago

@udexon send me an email to shawmakesmagic@gmail.com and I'll give you an example of streaming tensorflowjs with sockets + jitsi that mostly works. If you want to work together to clean it up and make it something we can put out into the world, I'm all about it, as long as it's FOSS :)

Email sent.

Question for ALL:

If you refer to Hu Ningxin OpenCV WASM demo:

https://codepen.io/huningxin/pen/NvjdeN

vc = new cv.VideoCapture(video);
    
function processVideo() {stats.begin();
  vc.read(src);

The object vc is the video stream. vc.read(src) stores the current video frame into src for further processing.

What is the easiest way to do this in Jitsi?

udexon commented 4 years ago

Update 4 (10 May 2020)

Found code from TFJS BodyPix Mask and modified JitsiStreamBlurEffect.js:

https://github.com/udexon/Phoom/blob/master/Jitsi_Meet_BodyPix.md

Update 3 (8 May 2020)

https://github.com/udexon/Phoom/blob/master/R3ML_JM_Fundamentals.md

After several days (much longer than I initially estimated) combing through Jitsi-Meet code, without any help from anyone whatsoever, I found this command:

APP.store.getState()['features/base/tracks']

So Jitsi-Meet allows lots of internal variables to be access globally (via the console or otherwise) through the variable APP.

Next step:

https://github.com/jitsi/jitsi-meet/blob/master/react/features/stream-effects/blur/JitsiStreamBlurEffect.js

Check the BodyPix algorithm accesses the above via wrapper functions.

====

I found a thread on similar issue:

https://news.ycombinator.com/item?id=22823070

(Update ) Following up with search keyword "blur", I found:

https://github.com/jitsi/jitsi-meet/blob/master/react/features/stream-effects/blur/index.js

(Update 2) I had a look at the package given in the code above. Very impressive demo. Even works with Firefox mobile browser!!

https://www.npmjs.com/package/@tensorflow-models/body-pix

https://storage.googleapis.com/tfjs-models/demos/body-pix/index.html

How can I write a simple test to extract the body of the person?

Anyway, I have opened a new issue for further discussion:

https://community.jitsi.org/t/how-to-get-current-video-frame-in-jitsi/49250

Cdddo commented 4 years ago

I'm also very interested in the possibility of a 3D mode (both in normal flat displays and VR devices) with people being able to navigate the space and having positional audio. This would be incredibly useful for large meetings such as conferences and webinars where you want to be able to interact with smaller groups without having to end the whole event. Of course applications such as VRChat already implement this but they have a lot of additional options that aren't necessary, and most importantly have too many steps to actually get to an event, its not at all simple for a new user to just jump into the meeting without an elaborate setup process (which is a mayor Jitsi advantage).

I'm personally starting a project to do exactly this (not based on Jitsi) but if the Jitsi team can add this functionally and able to work straight in the browser that would be incredible.

@Giszmo if you're interested send me a message.

Giszmo commented 4 years ago

@Cdddo you name it. I want the positional audio from VRChat without having to explain to all participants how to install VRChat. (I haven't installed it myself yet as it's not trivial on Linux it seams.)

capnmidnight commented 4 years ago

I managed to hack positional audio into the default Jitsi Meet install using the IFrame API. It would probably be easier using lib-jisti-meet, but I didn't have the time to put together the full meeti g UI at the time.

https://github.com/capnmidnight/Calla

Giszmo commented 4 years ago

Another project similar to calla: https://theonline.town/

capnmidnight commented 4 years ago

Well, yes, but they aren't built on Jitsi and they aren't open source. My only point was that I have code to show that it's possible.

Giszmo commented 4 years ago

@capnmidnight your project is awesome. Both projects do about what I would consider for a simple implementation but even simpler as I don't think the tile map or background picture would be necessary for people to organize in groups. Could just as well be a chess board. Let's meet at f3. I would focus more on the audio aspects instead. Add stereo, so you hear one person to your right and the other to your left etc.

capnmidnight commented 4 years ago

Yes, I have full spatialized audio. You hear people left/right of you on the X-axis, you hear people front/back of you on the Y-axis.

udexon commented 4 years ago

It has been quiet for a while.

I am back here to report the latest work:

PhosIDE Part III: Extracting Jitsi Meet BodyPix Mask as Three.js CanvasTexture on Rotating Cube in Chat Panel https://github.com/udexon/Phoshell/blob/master/PhosIDE_Part_III.md

I have done the foundational work for an Augmented Reality Conferencing App as initially proposed.

So we now have a simplified development framework using Phoscript, derived from the Forth programming language, executable from within Jitsi Meet, which in turn interfaces to these state of the art libraries:

TFJS BodyPix (TypeScript)
three.js WebGL (JavaScript)
Jitsi Meet (React Redux)

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jitsi / jitsi-meet

VR mode #5269