watfordjc / GameChatLite

A potential WIP to create a stripped down Discord client that supports voice chat without high CPU usage.
0 stars 0 forks source link

Crete user identification system #2

Open watfordjc opened 2 years ago

watfordjc commented 2 years ago

Feature Branch

Current feature branch for this issue: not created yet.

Progress


Background

Whilst working on #1, I realised that having a simple proxy relay things between the application and Discord is undesirable as it could allow flooding of Discord's servers with the only possible option at Discord's end to block my server's IP address and/or my app's API key.

I removed logins from my Web sites so long ago I'd have to check my Web site's archive section... site redesign and content consolidation started on 24th February 2014 (new site doesn't have logins). Consolidation of domain names started on 29th December 2014. Site migration (301 Moved Permanently, 410 Gone) was completed by 20th October 2015, with most of the NGINX rewrite configuration lines created on 13th January 2015. The switch to the old site only serving redirects and errors occurred at some point between January and October 2015, which means I haven't had a Web site with a login system for 6 years.

It isn't like I don't have anything using authentication nowadays. One of my internal Android apps uses TLS client certificates for authentication, as do my enterprise Wi-Fi SSIDs. Another internal Android app uses Firebase Cloud Messaging to receive notifications which I think uses Firebase Auth anonymous accounts, as does one of my published Android apps.

This issue is for the entire question of how the application is going to deal with authentication, identification, and the linking of identifiers.

watfordjc commented 2 years ago

GPDR, DPA, Data Controller, Data Protection Officer

Doing authentication of others means processing data that could be used to identify them.

From deciphering the ICO's dumbing down of the law, if the identifiers processed/stored/etc., such as a Firebase UID, Discord user ID (snowflake), Discord username (including the #tag), alone or combined, as well as any information that could be obtained by having those identifiers (such as using the Discord API to get a user object from an ID), and infinite levels of recursion that uses the data to find new identifiers and data/information, with all of it being combined into a dataset, eventually reaches the point where the dataset can identify an individual (or be used in combination with other not-quite-personally-identifiable data to identify an individual), those original identifiers are considered personal information.

The ICO have a hard to comprehend question in their fee self-assessment:

Do you only process information for one of the following purposes?

  • personal, family or household affairs not connected to commercial or professional activities (including CCTV to monitor your domestic property, even if you are capturing images outside the boundaries of your property)

—Question 5, Registration self-assessment, Data protection fee, ICO.

Their "domestic purposes" exemption definition isn't much clearer:

Some things are not listed here as exemptions, although in practice they work a bit like an exemption. This is simply because they are not covered by the UK GDPR. Here are some examples:

  • Domestic purposes – personal data processed in the course of a purely personal or household activity, with no connection to a professional or commercial activity, is outside the UK GDPR’s scope. This means that if you only use personal data for such things as writing to friends and family or taking pictures for your own enjoyment, you are not subject to the UK GDPR.

Exemptions, Guide to the GDPR, ICO.

Both appear to be in reference to Recital 18 of the (EU) GDPR:

Recital 18, EU GDPR

(18) This Regulation does not apply to the processing of personal data by a natural person in the course of a purely personal or household activity and thus with no connection to a professional or commercial activity.

Personal or household activities could include correspondence and the holding of addresses, or social networking and online activity undertaken within the context of such activities.

However, this Regulation applies to controllers or processors which provide the means for processing personal data for such personal or household activities.

Recital 18, EU GDPR.

It's the last sentence of Recital 18 that means Recital 18 doesn't apply. If you are a mediary between two natural persons and using their data to allow them to contact each other, you could be a data controller. If you're giving a friend's phone number to another friend without their consent, that might be more of a grey area.

Storing a Discord ID snowflake (and other identifiers) in order for users to conduct activities that are potentially covered by Recital 18 leans more towards the I'm a Data Controller side of things. Without the exemption applying I would also need to register as a Data Protection Officer and a pay an annual fee. As I'm using my personal Google account for Firebase (and Discord), the Data Controller would (for now at least) be John Cook rather than John Cook Limited.

Data Protection and Privacy

The initial version of this issue is where I got up to before thinking about the DPA and GDPR. It is obvious in that version, however, that I was considering them per the Purpose Limitation data protection principle and the Consent lawful basis for collecting/user personal data. Data protection by design and default, as the ICO calls it.

I think I also covered the Rights of Individuals: right to be informed, right of access, and the rights of rectification, erasure, secure processing, data portability, and object to processing, and the rights related to automated decision making (including profiling).

Special Category Data

There is one thing that I didn't think of: special category data, such as race, political opinions, religious/philosophical beliefs, health, etc. I may not be collecting such data, but a user's data (such as their e-mail address or Discord username - identifiers are personal information, as established above) could allow an inference to be made about such special category data.

There are two things (possibly a third) needed to store special category data:

  1. Have a lawful basic under Article 6,
  2. Meet one of the specific conditions under Article 9, and potentially
  3. Have an "appropriate policy document" in place.

While I may not have actually considered the matter of such data, the first version of this issue started with two items in the task list:

  • [ ] Decide how to create a user identification system and update this task list.
  • [ ] Determine how to acquire explicit opt-in of storing identifiers, cookies, etc. through the application's UI.

Explicit consent is both a lawful basis under Article 6, and a specific condition for processing the data under Article 9, and an explicit opt-in is a synonym of explicit consent. That just leaves the question of whether I need an appropriate policy document, which the ICO says can probably be answered by conducting a Data Protection Impact Assessment.

Friends List

Given applications need whitelisting/approval for the Discord scope relationships.read, I would likely need another way of processing data in order to connect users, such as a friends list.

From what I can make out of the API documentation, the only way to DM someone is with the other user's ID snowflake, the only way to get someone's username and discriminator is with their ID snowflake, and the only way to call someone is with a DM channel ID (and getting a DM Channel ID requires the other user's ID snowflake).

For these reasons, my thinking process was that adding a friend would be as follows:

It looks like it would be feasible for Game Chat Lite to subscribe to the gateway intents DIRECT_MESSAGES, DIRECT_MESSAGE_REACTIONS, and DIRECT_MESSAGE_TYPING to identify existing DM channels and use that to populate a user list as such events should (not guaranteed) contain a user object.

watfordjc commented 2 years ago

Installation and User IDs

The data protection issues can be put aside for a while because, as with my other repositories, I might move onto another idea before I ever get to the point of implementing such a user system.

Overall User System Development Stages

For me to actually spend a bit of time on this software I need to implement at least part of the Discord API, so the development will likely go in several phases:

  1. Recital 18 applies. I am the only person using the software and I am using it in a personal capacity to talk to others on Discord.
    • DPA/GDPR - Out of Scope. Other people whom build the software from source need to replicate the backend. They are not communicating with my servers. The data I store is a bit like a cross between a personal address book and a diary.
  2. Recital 18 applies. People I talk to on Discord are using the software in a personal capacity to talk to me. The software acts somewhat like an online support system locked behind an invite-only login form. The only person you can contact with it (if you're not me) is me, and everyone I don't want contacting me cannot use the software to do so. Establishing initial communication with my servers requires an invite that is impossible to obtain if I don't know you.
    • DPA/GDPR - Out of Scope. Other people whom build the software from source need to replicate the backend. They are not communicating with my servers. The data I store is a bit like a personal address book.
  3. Recital 18 doesn't apply. There is an onboarding process that isn't by personal invite, there is a friends list or similar system, and users can communicate with whomever they want.
    • DPA/GDPR - In Scope. The application is no longer solely used for personal purposes.
    • DPA/GDPR - Out of Scope. Other people whom build the software from source need to replicate the backend if they are not communicating with my servers.

User IDs and Device IDs: Chicken or Egg?

How do you identify a user without being able to identify a user?

OAuth2 can use a state value during authentication with the OAuth2 provider. This is generated by the relying party (they whom the user wants to give permissions to), handed over during authentication with the Oauth2 provider, and handed back to the relying party (along with the user's credentials and ID with the Oauth2 provider). The relying party then links whatever data they stored when they created state with the user's ID used with the OAuth2 provider.

state could just be something random that is stored and used to prevent their "login with" backend being used as an open redirect or similar service open to abuse. An example would be a JWT signed by the relying party. Because the relying party is creating, storing, and checking the value, they could say "no, I'm not generating and signing a state for you" and "no, I'm not accepting that data because I don't recognise the state value", both of which could probably be dealt with as a 403 Forbidden or something.

The chicken/egg problem is in the logic. If someone wants to sign in and you have no idea who they are, how do you remember who they are for when they return from the OAuth2 provider (if they ever do)? The OAuth2 provider is going to tell you who the user is, but they might not support multiple simultaneous authorizations for the same application - i.e. the OAuth2 provider's list of applications a user has authorized will never list the same application twice because a new access token authorising access for that application supersedes/replaces any previous access token.

That means if you want to ensure the ability to identify a device/installation, you probably need to do it yourself. If an OAuth2 provider supports multiple access tokens for a user-application combination, they might not provide anything other than the tokens and state that would differentiate it from the user's other logins with your application.

I considered using Firebase anonymous authentication to identify a device, get an ID token and exchange it for a session token (HTTP JWT cookie), link the session token/UID to some random-ish state, redirecting to the OAuth2 provider for authorisation, and then on the redirect back checking the state is linked to the HTTP cookie and if so upgrading the anonymous account to a per-installation account tied to the OAuth2 provider user ID (and credentials), which because Discord isn't one of the "federated identify providers" Firebase supports would mean creating a custom auth provider.

That sounded like a lot of server-side crypto steps that are unlikely to be optimisable.

Snowflake IDs

Discord uses a derivation of Twitter Snowflake IDs for its user identifiers. A snowflake tends to consist of a "timestamp" component, a static component "instance" that typically identifies the unique thread that created the snowflake (e.g. a machine ID), and a counter "sequence" that increases from 0 during that timespan's period and resets to 0 when the timestamp changes (typically every millisecond).

Snowflakes are designed to be unique. They are also designed to be coarsely sortable by time - the timestamp component can be sorted, but snowflakes sharing a timestamp (or multiple timestamps if the clocks for all instances aren't very consistent) can not be sorted into the order they were generated. Because the time component is a custom number of bits, each implementer uses different bit lengths and epochs to suit their usage. As the timestamp is typically in millseconds since the epoch, the fewer bits given to the timestamp the sooner an overflow will be reached a la Y2K and the Year 2038 Problem.

To take dwayn/snowflaked as an example, the highest bit (big endian) is fixed at 0 to ensure it is always a positive signed number. The next 41 bits are milliseconds since the epoch: 2^41 / 1,000 ~= 2199023255 seconds ~= 25,451 days. Timestamp overflow therefore occurs around 69 years 8 months 3 weeks after the chosen epoch. There are then 4 bits for a region ID, 10 bits for a worker ID, and 8 bits for the sequence number (allowing a maximum 256 snowflake IDs to be generated per instance per millisecond).

If snowflaked runs out of sequence numbers for a given millisecond and another snowflake ID is requested, it'll will wait until the next millisecond to create it. To avoid the possibility of generating a duplicate snowflake ID, it doesn't issue snowflakes with a future timestamp, nor does it issue snowflakes with a timestamp earlier than the most recently used timestamp - i.e. if NTP changes the time to a point in the past, and a snowflake ID was generated in the millisecond before NTP changed the time, it'll wait until the system clock has caught up to the previous present.

Based on snowflaked is the redis module erans/redissnowflake, which adds the command snowflake.getid to redis. Generating a device/user ID as a snowflake, using the current time, two constants, a counter, some variables and if() checks, and some bit shifts and boolean maths, should be faster and more unique than my contemplation of using V1/V4 UUIDs.

state

If there isn't a user ID for the user/device, a state needs tying to a new snowflake ID. Upon a redirect returning from the OAuth2 provider, the state needs verifying against the user ID. The question that needs pondering is what state should be, how to efficiently generate it, how to store and retrieve it, how to persist it if whatever is storing it crashes or is restarted, and how to limit its lifespan to both single use and age limit.

A sha256 hash of something with a suitable amount of entropy, or a HMAC/JWT, sounds suitable here. It needs replay protection.

The basic flow is user makes an interaction to login, user connects to server, server does something to remember user and redirect to OAuth2 provider, user authenticates with OAuth2 provider and grants/denies permissions, and OAuth2 provider redirects to server. The server then connects to the OAuth2 server itself to acquire credentials, which it then gives to the user.

Identifiers and Cookies

Device ID

The device snowflake ID will probably be transient (session based cookie?) unless the user grants permissions for Game Chat Lite to access user data, where it then becomes an identifier linking a Discord user ID with a user/device.

The device snowflake ID is used for Game Chat Lite servers to identify the unique device/device-user combination and is signed by the server, either in a new_session cookie or converted to Firebase credentials and stored in the application's local storage.

State ID

The state snowflake ID is transient (can only be used once and it also expires), and it contains no useful data. It is linked to the device snowflake ID, but such linking is self-contained in the state variable and either the new_session cookie or the application's local storage.

The OAuth2 provider could link it to their dataset on the user, but since OAuth2 is being used for the user to give data access permissions to Game Chat Lite the OAuth2 provider will already know (via the authenticating user) who the user attempting to authenticate is and (via the application's client_id) what application the user is obtaining an authorization code for.

Put another way, this identifier's only purpose is to ensure a user-agent returning credentials from the OAuth2 provider is the user-agent used by the user and that the request was initiated via the application server, offering some minimal protection against replay attacks and some MITM attacks.

The state snowflake ID will also be stored in a blacklist on the server once it has been used until its validity window closes and it gets deleted. This means that once the HMAC contained in the state variable has been validated by the application server the symmetric key used to sign it becomes invalid for both signing and verifying.

User ID

To be determined.

watfordjc commented 2 years ago

Deterministic Symmetric Encryption Keys for HMAC

The idea to use SLIP-0021 derivation paths for deterministic symmetric keys for the state query parameter is borrowed from my LTO Encryption Keys wiki page of my (still not fully implemented) watfordjc/backup policy.

Although CPU usage of Discord and Windows sleep issues currently have my occasional attention, in order to use SLIP-0021 derivation paths and deterministic keys I need a way to create the derivation key for the master node, which means I need some form of master secret.

Backup Policy and LTO Tape Keys

My backup policy proposes to use BIP-0039 (Bitcoin wallet mnemonic code) and SLIP-0021's usage of BIP-0032 (Hierarchical & Deterministic Bitcoin wallet) to create a deterministic SLIP-0021 master secret from a BIP-0039 mnemonic. The code for doing so has mostly been implemented in watfordjc/LTO-Encryption-Manager, but I haven't got around to storing user preferences, securely storing the derivation key of the node below the master node, or connecting it to watfordjc/LTO-Encryption-SPTI (which should probably partly become a library) and setting the encryption key on an LTO tape drive.

At present, LTO-Encryption-Manager doesn't do much. It has two input boxes with buttons (hexadecimal entropy, and BIP-0039 mnemonic seed), and pressing the button for one fills in the text box for the other. The entropy/seed is then turned into a hexadecimal representation and populates the hexadecimal seed text box, and then the other text boxes are populated with the values for the master node derivation key, master node symmetric key, and the symmetric keys from the derivation paths m/"SLIP-0021", m/"SLIP-0021"/"Master encryption key", and m/"SLIP-0021"/"Authentication key".

For LTO tapes, the plan is to transfer the symmetric key for the inserted tape to the drive by using RSA-2048 wrapped AES keys (RFC 3447). This ensures that only the drive can read the key, although it cannot determine who sent it the key - the drive's public key is used to wrap the key, my drive doesn't appear to support the adding/importing of public keys and certificates although it might be possible using TLS or ADT (would need to make a custom Ethernet or serial/ADT cable, respectively).

Master Node Derivation Key

The derivation key for the master node should probably be kept offline, or not kept at all. This key can create all derivation and symmetric keys for all nodes below it. With SLIP-0021 saying a first-level label MUST identify the purpose and layout of the nodes below it, it would perhaps make more sense to require the BIP-0039 mnemonic code whenever new first-level labels are created.

The application/utility could temporarily create the master node, derive and store the derivation keys for the new first-level labels, and then clear everything used to derive those keys (master node derivation key, master node secret, PBKDF2 inputs, user input).

Master Node's Children

SLIP-0021 says the first level below the master node should be descriptive and define the tree of all nodes below it. In other words: it should be unique.

The child nodes of a parent node N are identified by a variable-length byte string called a label. The labels of all nodes which are derived from the master node, i.e., the first-level labels, MUST identify the purpose of the subordinate nodes. The purpose determines the further structure beneath the node. This label must be sufficiently unique to avoid collisions between applications. Examples include the ASCII encoding of the strings "BIP-9999", "SLIP-9999" or "FIDO2 Trezor Credential ID".

Child Node Derivation, SLIP-0021

Without a BIP, SLIP, RFC/STD, IANA database, or trademark, a unique purpose string could be problematic. Usually the preferred option to avoid namespace collisions would be to use a UUID, or use reverse DNS notation with a domain you control. I would probably opt for reverse DNS notation, so m/"SLIP-0021"/"LTO AES256-GCM" would become something like m/"uk.johncook.slip-0021.lto-aes256-gcm".

Derivation Paths and SLIP-0021 First-Level Label

m/"uk.johncook.slip-0021.snowflake-hmac-sha256"/"0"/"1"/"1"/"2021-09-01"/"0"/"70955254678034981", where:

There is no node defined below the snowflake ID node for key rollovers because this scheme is defined for server-verifiable signing of data related to a transient server-generated snowflake ID.

The second-level rollover counter is so derivation key rollovers can occur at the software that generates the keys for all servers.

The sixth-level rollover counter is so that derivation key rollovers can occur at the software that generates the keys for snowflake IDs. For example, if worker IDs 1 through 5 in region ID 2 and worker ID 7 in region 5 need rolling over for some reason, the software that generates server keys does not need to go through a key rollover itself (which would potentially require mnemonic seed input).

Enhancing LTO-Encryption-Manager

I think the derivation keys used by the server(s) should be derived on a local machine, somehow signed and encrypted so that only the server the key is for can decrypt it, and then uploaded to the server.

Storing the derivation key for the first-level label (or perhaps the second-level label if the derivation path scheme assumes the need for key rollover support) will need to be done for all SLIP-0021 purposes I have, whether it be LTO tape encryption keys or snowflake HMAC keys. It therefore makes sense to add the needed support to LTO-Encryption-Manager and to work out how I'm going to store the derivation keys securely.

This has been added as issue watfordjc/LTO-Encryption-Manager#1.