hestiaAI / hestialabs-experiences

HestiaLabs Data Experiences & Digipower Academy
https://digipower.academy
Other
7 stars 1 forks source link

Proof of concept: Ingesting data to produce a time-information-record with multiple views #1036

Open alexbfree opened 1 year ago

alexbfree commented 1 year ago

[Technology push idea: yet to be validated by bizdev/Paul]

Based on what i have understood about our need to support time-series data... ie. displaying events or periods of time (such as mapping out the recorded periods within an Uber driver's day), potentially from multiple data sources, I propse the following:

That we should build a simple proof of concept, in Jupyter or as an experience, which can do the following:

  1. import 2-3 different sources (maybe an uber file, a google location history file and a twitter file?)

  2. From those imported data files, populate a common time-based information record, where everything is a time-indexed event (1-event) or period (2-event/n-event)

  3. produce at least two views (tabular list + lines on geo map?) of the combined information data.

This would allow us to hit a lot of the major points of exploration and explore ideas around semanticisation, separating data parsing from information management/display, etc.

What does Paul/bizdev think of this?

pdehaye commented 1 year ago

How is this different from what @emmanuel-hestia is already doing? https://github.com/hestiaAI/clients/issues/35

See also https://github.com/hestiaAI/clients/issues/43

alexbfree commented 1 year ago

I hadn't seen https://github.com/hestiaAI/clients/issues/43 - but yes, you are right it has elements of both https://github.com/hestiaAI/clients/issues/35 and https://github.com/hestiaAI/clients/issues/43.

I think it's maybe a slightly higher level than https://github.com/hestiaAI/clients/issues/35 but less broad than https://github.com/hestiaAI/clients/issues/43, and as such is more actionable, more end-user focused, and more deliverable.

I think this proposal is

pdehaye commented 1 year ago

https://github.com/hestiaAI/clients/issues/43 has the advantage of not being target-fixated on time series data, but I can see the advantage of a tightly defined deliverable.

Hence one option is to do this issue four times over, for four distinct datasets (picked from the gems, or what is about to become a gem):

Each has the advantage of being focused, but there should definitely be patterns emerging. In particular the first two have a very present geographic component (cityscape) while the other two have a very present infoscape component (in different ways: I navigate the space of tweets versus my data navigates an annotated TrackerControl map upstream of Google, essentially).

alexbfree commented 1 year ago

I can see the merit of splitting it up, allowing us to find the richest data from each specific provider. However something is lost by splitting it up, and that's that if each one is handled separately, we risk each instance becoming too focused on the specifics of an individual case.

I think as an organisation we need to begin to make a shift from

"let's show you everything we can from a set of data files from one company"

to

"let's construct a view of information about an individual day/session/trip/work shift etc, drawn from multiple data sources from different companies"

In other words, if we don't handle more than one company's source in the same ticket, then we bypass the opportunity to look at modelling information from multiple sources into a common format. And I think that is a really important step.

Also this has UX implications too. For example - how does our Upload page change once it has to support upload from multiple different companies (presumably each has to have a separate upload box, or all in same upload box but with company types/labels, or something).

So i think if we split it into 5 we would also need to have a fifth ticket to unify data from different sources. And that might require some rework of work done for the four subtickets described above.

Perhaps another way to do this is that we have one ticket to come up with a unified time-based data model for different companies' data sources, then the 4 tickets you describe (which would then all use that common model), then a sixth ticket which does the unification.

I suppose the other approach is that we just do the 4 tickets described above, without any overlap or commonality, making each one the best form it can be, but accept they are proofs of concept that inform a secondary piece of unification modelling/UX work.

alexbfree commented 1 year ago

(and yeah, i take the point this ticket focuses on time and there are other focuses - we could imagine parallel/equivalent efforts to this one that have other focuses e.g. a region, a person, etc.)

alexbfree commented 1 year ago

(picked from the gems, or what is about to become a gem)

Do we have a list somewhere?

alexbfree commented 1 year ago

One important strategic question that should precede this ticket, is now documented in #65

alexbfree commented 1 year ago

@fquellec mentions this is similar to another recent issue about unifying schema, which will enable things like it => hestiaAI/clients/issues/43

alexbfree commented 1 year ago

Put on hold until 43 is finished