Proof of concept: Ingesting data to produce a time-information-record with multiple views

alexbfree commented 1 year ago

[Technology push idea: yet to be validated by bizdev/Paul]

Based on what i have understood about our need to support time-series data... ie. displaying events or periods of time (such as mapping out the recorded periods within an Uber driver's day), potentially from multiple data sources, I propse the following:

That we should build a simple proof of concept, in Jupyter or as an experience, which can do the following:

import 2-3 different sources (maybe an uber file, a google location history file and a twitter file?)
From those imported data files, populate a common time-based information record, where everything is a time-indexed event (1-event) or period (2-event/n-event)
produce at least two views (tabular list + lines on geo map?) of the combined information data.

This would allow us to hit a lot of the major points of exploration and explore ideas around semanticisation, separating data parsing from information management/display, etc.

parsing data to extract information
identifying and codifying information into a common schema despite it coming from different sources/files
supporting different views (not just timeline) of time-based data

What does Paul/bizdev think of this?

pdehaye commented 1 year ago

How is this different from what @emmanuel-hestia is already doing? https://github.com/hestiaAI/clients/issues/35

alexbfree commented 1 year ago

I hadn't seen https://github.com/hestiaAI/clients/issues/43 - but yes, you are right it has elements of both https://github.com/hestiaAI/clients/issues/35 and https://github.com/hestiaAI/clients/issues/43.

I think it's maybe a slightly higher level than https://github.com/hestiaAI/clients/issues/35 but less broad than https://github.com/hestiaAI/clients/issues/43, and as such is more actionable, more end-user focused, and more deliverable.

I think this proposal is

well constrained/simplified
a more tightly defined deliverable, focused on proving concepts and exploring approaches

pdehaye commented 1 year ago

https://github.com/hestiaAI/clients/issues/43 has the advantage of not being target-fixated on time series data, but I can see the advantage of a tightly defined deliverable.

Hence one option is to do this issue four times over, for four distinct datasets (picked from the gems, or what is about to become a gem):

Uber data and trip-by-trip accounting https://github.com/hestiaAI/clients/issues/35
Gmaps reconciliation of the different datasets (advantage is that we already have the data structure pretty much for free, but we want to enrich the visualization capabilities - some work was required by Thomas and is thus already in JS and even in Jupyter)
Twitter file reconciliation between ad impressions and engagements (same ad within a short timespan that appears in both, this is what helped @fquellec understand the abstraction I am after across Twitter and Uber data)
Google Mobile Services audit through App Audit for different apps. https://github.com/hestiaAI/Mobile-App-Auditing/issues/7

Each has the advantage of being focused, but there should definitely be patterns emerging. In particular the first two have a very present geographic component (cityscape) while the other two have a very present infoscape component (in different ways: I navigate the space of tweets versus my data navigates an annotated TrackerControl map upstream of Google, essentially).

alexbfree commented 1 year ago

I can see the merit of splitting it up, allowing us to find the richest data from each specific provider. However something is lost by splitting it up, and that's that if each one is handled separately, we risk each instance becoming too focused on the specifics of an individual case.

I think as an organisation we need to begin to make a shift from

"let's show you everything we can from a set of data files from one company"

to

"let's construct a view of information about an individual day/session/trip/work shift etc, drawn from multiple data sources from different companies"

In other words, if we don't handle more than one company's source in the same ticket, then we bypass the opportunity to look at modelling information from multiple sources into a common format. And I think that is a really important step.

Also this has UX implications too. For example - how does our Upload page change once it has to support upload from multiple different companies (presumably each has to have a separate upload box, or all in same upload box but with company types/labels, or something).

So i think if we split it into 5 we would also need to have a fifth ticket to unify data from different sources. And that might require some rework of work done for the four subtickets described above.

Perhaps another way to do this is that we have one ticket to come up with a unified time-based data model for different companies' data sources, then the 4 tickets you describe (which would then all use that common model), then a sixth ticket which does the unification.

I suppose the other approach is that we just do the 4 tickets described above, without any overlap or commonality, making each one the best form it can be, but accept they are proofs of concept that inform a secondary piece of unification modelling/UX work.

alexbfree commented 1 year ago

(and yeah, i take the point this ticket focuses on time and there are other focuses - we could imagine parallel/equivalent efforts to this one that have other focuses e.g. a region, a person, etc.)

alexbfree commented 1 year ago

(picked from the gems, or what is about to become a gem)

Do we have a list somewhere?

alexbfree commented 1 year ago

One important strategic question that should precede this ticket, is now documented in #65

alexbfree commented 1 year ago

@fquellec mentions this is similar to another recent issue about unifying schema, which will enable things like it => hestiaAI/clients/issues/43

alexbfree commented 1 year ago

Put on hold until 43 is finished

hestiaAI / hestialabs-experiences

Proof of concept: Ingesting data to produce a time-information-record with multiple views #1036