Add Log In With Discord

Stages to Establishing a Discord Voice Call

There are several things needed to connect to a Discord call:

A Discord access token for the user, via the HTTP API and OAuth2.
The URI of a gateway server, via the HTTP API.
A secure websocket (wss) connection established to the gateway.
Sending periodic heartbeats, handling WSS messages that require action, basically follow the spec so that an established connection can be maintained.
Start/join a Discord call:
1. Get voice server information from the gateway.
  - Request contains the Guild(?) and Channel IDs.
  - A Voice Server Update event contains the details of the voice server we should use.
  - A Voice State Update event contains the details of our voice connection state, including Channel ID and Voice Session ID.
  - Both events must be received before continuing.
2. Create a WSS connection to the voice server, and identify. The server will respond with several things needed for the RTP connection: server port, server IP, supported encryption modes, and the RTP SSRC for the connection (used for multiplexing multiple media sources over a single RTP connection).
3. Sending periodic heartbeats, handling WSS messages that require action, basically follow the spec so that an established connection can be maintained.
4. Establish a UDP connection with the voice server:
  1. Connect to the voice server using UDP.
  2. Determine our public IP address and port for the UDP connection (the voice server can do that via IP discovery).
  3. Tell the voice server via Voice WSS the details for our end of the voice connection (UDP, IP, port, encryption mode).
  4. Receive via Voice WSS the encryption mode and encryption key for the voice connection.
  5. Send and receive voice packets.
    - The protocol used is RTP with Salsa20/Poly1305 encryption.
    - The audio is encoded using Opus - stereo with 48 kHz sample rate.
    - The voice packet is RTP header + Encrypted Opus Audio.
    - The Encrypted Opus Audio uses the encryption key received via Voice WSS and a 24 byte nonce equal to the 12 bytes of the RTP header and 12 NUL bytes.
    - Signal via Voice *WSS when speaking state changes (SSRC**, plus a speaking bitmask consisting of active microphone, active soundshare, priority speaker).
    - Send a voice packet containing 5 frames of silence whenever breaking the audio stream. If using PTT, do this when the button is released. If using a form of voice detection, do this when voice is no longer detected.

If the call is between two users (i.e. it is a DM Call or Group DM Call, rather than joining a Voice Channel), I'm not sure at which point in the above flow the user initiating the call should wait for the other user(s). Do you go through the entire process resulting in the channel having a voice session associated with it and hope someone else in the channel then joins the voice session, or do you wait for someone else to join the voice session before creating the voice connection?

Discord OAuth2

I haven't done anything with OAuth2 other than the typical user thing of clicking/tapping/scanning something to use it to sign in somewhere.

There are 4 things involved in the OAuth2 flow, although I've added a fifth and sixth one to aid in comprehension:

The Client. This is the thing The Human is technically wanting to give permissions to.
The Resource Owner. This is The Human (other species, such as The Bot, can also own a resource). The Resource is the thing The Human wants to give The Client certain permissions to access/use (such as a Discord account).
The Authorization Server. This is the thing The User-Agent talks to when The Human wants a code. It gives the authorization code to The Client via The Client's Redirection URI(s).
The User-Agent. This is the thing The Human uses to authenticate they really are the human that owns the resource, and to grant/refuse permissions to The Client.
The Non-Client Client. This is the thing The Human really wants to give permissions to, but because the Client Secret should be kept secret The Non-Client Client needs a middleman to perform the functions of The Client.
The Token Server. This is the thing The Client talks to when The Non-Client Client wants to exchange a code or a refresh_token for an access_token and refresh_token.

I had written a really long comment about how things would/could work, how I was thinking of implementing it, and the like, but my computer crashed and the textarea wasn't restored when I reopened my Web browser. As a result, this comment will likely contain less content than it would have without the crash.

The first stage in the OAuth2 flow is obtaining an authorization code from the Authorization Server. This is done by making an HTTP GET request in the User-Agent to the Authorization Server, with The Human then authenticating (e.g. by username/password/2FA) and granting/denying permissions. The Authorization Server then redirects to The Client's Redirection URI, with a code query parameter.

There are two query parameters that can change the authorization code grant flow. The first is the state query parameter which is somehow used for CSRF protection. The second is the (Discord) prompt query parameter, which can change how The Human gives permissions if The Client has already been given permissions.

Obtaining an Access Token

There are several steps involved in obtaining an access token. Although an OAuth2 provider can implement different methods, I am going with the most functional/secure one that Discord offers:

The Human wants to share permissions with The Client, so they interact with something to start the process (e.g. a Sign In With Discord button).
The User-Agent connects to The Client. The Client creates an identifier to remember The User-Agent (such as a session token) and gives The User-Agent something linked to the identifier (such as a cookie).
The Client redirects The User-Agent to The Authorization Server.
The Authorization Server asks The Human to identify themselves (such as by logging in).
The Authorization Server asks The Human if they want to grant/deny account permissions to The Client.
The Authorization Server redirects The User-Agent to The Client. It includes an authorization code in the redirect data.
The Client connects to The Token Server, and exchanges the code for an access_token and refresh_token.
The Client can now do the things The Human has given it permission to do (typically on the condition The Client only do things The Human explicitly tells them to).
The Client connects to The Token Server before/after the access_token expires, and exchanges the refresh_token for a new access_token and refresh_token.

I am going to skip step two for now because I don't (currently) see a need to remember the users that have logged into Discord in Game Chat Lite:

The Human wants to use the Discord functionality of Game Chat Lite, so they click the Login With Discord button.
Game Chat Lite creates a hidden (Visibility.Collapsed) WebView2 control and navigates to the Authorization Server's URL.
If the Authorization Server doesn't redirect to the Redirection URI, Game Chat Lite waits for the page to finish loading and makes the WebView2 control visible so The Human can identify themselves.
The Authorization Server asks The Human if they want to grant/deny account permissions to The Client.
The Authorization Server redirects the WebView2 control to The Client, but Game Chat Lite ignores the redirect. It makes the WebView2 control hidden again.
1. Game Chat Lite loads an HTML page (same origin as the Redirection URI) to talk to The Client, and calls an ECMAScript function: getToken(code).
2. The JS connects to the Redirection URI (with the code query parameter/value), which is a URI for a Server Sent Events stream. The Client updates The User-Agent as it makes progress, and The User-Agent relays that information (and any issues with the SSE connection) to Game Chat Lite via Web messages.
3. The Client connects to The Token Server, and exchanges the code for an access_token and refresh_token.
4. The Client sends the JSON received from The Token Server as a Server Sent Event, which the JS reformats as JSON and sends to Game Chat Lite as a Web message.
5. Game Chat Lite tries to parse the Web message. If successful, it has the values from The Token Server for the access_token, token_type, expires_in, refresh_token, and scope.
Game Chat Lite can now do the things The Human has given it permission to do (typically on the condition Game Chat Lite only do things The Human explicitly tells them to).
Game Chat Lite loads the HTML page before/after the access_token expires. It asks The Client to connect to The Token Server and exchange the refresh_token for a new access_token and refresh_token. The Client relays the JSON to Game Chat Lite via SSE.

If The User-Agent is only going to be following redirects without any input from The Human, there is no point actually showing The User-Agent. With suitable event handlers it is possible to update The Human on the progress throughout.

Server Sent Events

Server Sent Events (SSE) are a one-way communications method between a Web server and a Web client. The client connects to a URL for an SSE stream, and the server periodically sends events (a simple key: value text-based structure).

There are other ways of performing such communication, but I have no "I always do it that way" option. Yesterday, when I was thinking of similar things I'd used over the last few years, I listed FireBase push notifications, jQuery JSON fetching and setTimeout() HTTP 200/304 polling, and FireBase cloud database synchronisation. Server Sent Events means I've done things four different ways, and WebSockets would add a fifth way.

As my Web servers use HTTP/2, and the SSE stream is coded in PHP, there is a potential latency issue due to output buffering. There are two things needed for NGINX + PHP-FPM to treat the document as non-buffered output (something like chunked output, but not because HTTP/2):

At the top of the PHP file, while (ob_get_level()) { ob_end_flush(); } will disable all levels of output buffering for the PHP script.
After the PHP page echos/prints an event, call flush();.
In an NGINX location block for the SSE file, fastcgi_buffering off;.

There can also be an issue with timeouts:

NGINX has a default setting of fastcgi_read_timeout 60s;. If no output is sent from PHP-FPM to NGINX for 60 seconds, NGINX closes the connection.
PHP-FPM 7.2 production php.ini has a default setting of max_execution_time = 30. If a PHP script spends more than 30 seconds executing, the script is killed with an error.
- PHP execution time isn't (unless on Windows) real time. Any time the script is spent waiting for certain not-executing things, like sleep() and file_get_contents() are not counted in the execution time.
- fopen() separately has a default setting of default_socket_timeout = 60.
The RFC for OAuth2 RECOMMENDS that a code be valid a maximum of 10 minutes.
If a code is reused, all tokens issued for that code SHOULD be revoked and the OAuth2 process needs to start from the beginning.
A refresh_token may be revoked when it is used. A communication issue could mean a new access_token was issued but isn't available, requiring the OAuth2 process be started from the beginning.

watfordjc / GameChatLite