Extending Data Logging - Githubissues

I have looked into both how Google recommends storing data in Firebase, and what kinds of events and data the programming is currently generating. Google recommends mainly to keep the structure as flat as possible, and to denormalize it when needed.

This structure is generally what I came up with. It has 4 main datatypes:

Users: The anonymous users
Interfaces: The different test interfaces
Cycles: A set of actions from randomly generating a target location to the user correctly moving the end effector there
Actions: A single action or state change that takes place as part of a cycle

{
  // Users contains only meta info about each user such as session start time
  //  and any other information we need to save about them
  // This is stored under each user's unique ID
  "users": {
    "u1": {
      "date": "Wed Jun 10 2020",
      "time": "11:45:13 GMT-0700 (Pacific Daylight Time)",
      "timestamp": 1591814713656
    },
    "u2": { ... },
    "u3": { ... }
  },

  // Each interface type is split up into users by user ID
  // which each contain a list of cycle IDs
  "interfaces": {
    "arrow-press/release": {
      "u1": {
        "c1": {
          "totalCycleTime": 2, // Seconds
          "numberOfClicks": 4
        },
        "c2": { ... },
        "c3": { ... },
      },
      "u1": { ... },
      "u2": { ... }
    },
    "drag-click": { ... },
    "panel-press/release": { ... }
  },

  // Cycles and actions are separated from users
  // so we can still query individual events easily
  "cycles": {
    "c1": {
      "a1": {
        "date": "Thu Jun 27 2019",
        "eventName": "ring-release",
        "newState": "cursor-free",
        "prevState": "rotating",
        "time": "16:40:27 GMT-0700 (Pacific Daylight Time)",
        "timeStamp": 1561678827981
      },
      "a2": { ... },
      "a3": { ... }
    },
    "c2": { ... },
    "c3": { ... }
  }
}

User, Action, and Cycle ID's would probably randomly generated by Firebase, I just used easier to read strings here.

This structure should allow us to fairly easily and efficiently query cycles by either interface or by user. We might want to edit the structure if we will need to query the database differently. I would love any feedback you have on this. If it sounds good to you, then I can start implementing it!

Nice! I think this makes sense but I do have some notes/questions.

The anonymous OAuth through firebase returns a unique user ID, so we can use that as the user names under "users"
From an experimental perspective I think it makes sense to test each user on only one interface (i.e. have a "between-participants" experiment) because it would be hard to un learn an interface and switch to another one, so that might simplify things a bit (i.e. only one interface per user)
My first instinct would have been to log everything in one place as opposed to this more flattened structure that has cross references to things (like when you start a new cycle it gets logged under interfaces, but also a new entry under cycles is created for it). For example:

{
  "users": {
    "u1": {
      "date": "Wed Jun 10 2020",
      "time": "11:45:13 GMT-0700 (Pacific Daylight Time)",
      "timestamp": 1591814713656
      "cycles": {
        "c1": {
           "a1": { ..}
           "a2": { ..}
           ...
        }
        "c2": { .. }
        ..
      }
    },
    "u2": { ... },
    "u3": { ... }
  }
}

In other words go in more depth. Sounds like that goes against the advise you have read and I'm curious what the reasoning is. My reasoning for this would be to keep it simple and avoid risk of overwriting things, e.g. you'd only ever interact with that user's subtree during that user's session (in fact you can setup database permissions so the anonymous user has "write" access only to their own subtree). Might also be easier to reconstruct data. But I'm totally open to being convinced otherwise! Also I really appreciate putting thought in this upfront, though it is totally fine to start with something and revise later (has happened a lot in other projects).

Other than that, I think the other question to discuss is the list of events to be logged, given what we might want to measure and what actions the different interfaces afford. Perhaps a separate issue, or we can continue in this thread ;) Another related todo would be to actually write/test the scripts that will post-process the database to compute our measurements--that process always reveals something we forgot to log ;)

I think I know how to move forward -- what would be helpful for me is to know what level of autonomy you prefer me to have. Would you like me to go ahead and implement my ideas informed by your input, or should we discuss the changes that I have in mind? I want to make sure that I'm not bothering you unnecessarily, but that I still get your input.

In case you’d like to discuss, I’ve detailed the ideas below:

I think the changes you suggested make sense. The main reason that google had for keeping data flat was to make queries faster. For example, if you wanted to know how many users were in the database, you could just query "users" and you wouldn't get all of their cycles and actions along with that. There is a bit more reasoning and examples here: https://firebase.google.com/docs/database/web/structure-data#avoid_nesting_data, but basically, keeping the structure flat can reduce query sizes when you have a lot of data.

However, if I understand correctly, we will probably do one or two big queries after we run the study. It doesn't seem like the website will be regularly querying data from firebase, so it might not be worth it to implement a more complicated structure then we might need.

For now, I might go ahead with the structure that you suggested. If it turns out that we do have to make a lot of smaller queries, like you pointed out, we can always switch to a slightly different structure.

In terms of what actions would be logged, I noticed that each control type has a set of transitions and actions that it emits using the handleEvent function. I figured that all of those actions could be stored in the database as actions. From what I can tell, that seems like the right amount of data to store. I don't think we want the coordinate path of every single mouse movement, but I assume that we want more information than just, the time a cycle took for instance. From my testing, somewhere between 2 and 20+ actions are generated per cycle (the exact number depends on the interface). For each action, I think that it makes sense to store:

Previous State
Action Name
New State
Timestamp
XY Location (I'm not sure if this is necessary)

I would also count when the end effector and target are placed, and when the user successfully aligns them as events. That some of that data could also be included as cycle metadata.

mayacakmak / se2

Extending Data Logging #3