ravendb / ravendb

ACID Document Database
https://ravendb.net
Other
3.55k stars 820 forks source link

Please document How to Query on and roll up to the most recent entry in a time series #11921

Closed jhancock4d closed 1 year ago

jhancock4d commented 3 years ago

I have an array of time series within a contact:

contentView {
   "contentId: "1234",
   "contactId": "987",
   [Time series of double]
}

content {
    "id": "343",
    "name": "some content name",
    "minimumCorrectAnswers": 2,
    "questions": [array of QuestionIds],
    "dependsOn": [array of contentIds]
}

contentQuestionAnswer {
   "contentId": "1234",
   "contactId": "987",
   "questionId": "342eafa",
   [Time series of {"answerId", percentCredit}]
}

contactLogin {
   "contactId": "987",
   "success": true,
   "errorMessage": null
}

I have setup the contentQuestionAnswer so that the id convention is contentId/questionId/contactId which should allow me to load documents using LoadDocument in a projection.... I think. (or use an include)

I would like to store the most recent X number of entries in the time series

I would like to query on content Views of some contactid that has a most recent view of less than < 100% (i.e. get a list of content that the user has started watching but hasn't finished)

I would like to get all content for which there is minimumCorrectAnswers (i.e. time series where percentCredit = 100%) of unique Questions for each of the dependencies .

The same would be for "most recent login attempt" which would just be "last".

But I can't find any documentation on how to:

  1. Cull records in the time series by count (not date range)
  2. Get just the most recent entry of the entire time series (date doesn't matter, I just want the newest one) and use that in the query.
  3. Query for all contentViews where the newest time series entry is < 100% for a given contentId, and contactId
  4. Query for all content where all depends on have at least as many right answers in contentQuestionAnswer as required by the content. (select => LoadDocument contentId/contactId/questionId ).where most recent answer has percentComplete = 100.0

I think all of these are relevant to a ton of use cases and would be useful in the documentation. Even No. 4 above is load a document and then make sure that the returned array has at least a count of 2 where the last time series entry meets some filter criteria which I think would be very common for Time Series.

aviv86 commented 3 years ago

Hi

Querying on X most recent timeseries entries is currently not supported, but we are planning to add support for this in the future : https://issues.hibernatingrhinos.com/issue/RavenDB-15030

For now, except for by range date - we only support timeseries querying on Last (or First) X "time-units" (e.g. last 10 minutes, last 5 seconds, first 2 hours, etc.)

for example:

from contentView
where contactId = '987'
select timeseries(from SeriesName last 1 second where Value < 100)

this will return all the entries from the series' last one second, where entries' value < 100

As an alternative, you can use GetTimeSeriesStatisticsOperation to figure out the Start and End timestamps of your series, and then query the series using [Ènd, End] as the requested range, which will assure that you'll only get a single entry - the most recent one. e.g.

var op = await store.Operations.SendAsync(new GetTimeSeriesStatisticsOperation(documentId));
var end = op.TimeSeries[0].EndDate;

...

select timeseries($"from SeriesName between {end} and {end}")

In order to query across multiple timeseries and do filtering according to them, you should use time series indexes: https://ravendb.net/docs/article-page/5.1/csharp/document-extensions/timeseries/indexing

usage example :

private class TimeSeriesIndex : AbstractTimeSeriesIndexCreationTask<contentView>
{
    public class Result
    {
        public TimeSeriesEntry MostRecent { get; set; }

        public int Count { get; set; }

        public contentView Doc { get; set; }
    }

    public TimeSeriesIndex()
    {
        AddMap("contentView-timeSeries",
            timeSeries => from ts in timeSeries
            let doc = LoadDocument<contentView>(ts.DocumentId)
            select new 
            {
                MostRecent = ts.Entries.Last(),
                Count = ts.Count,
                Doc = doc
            });
    }
}
var query1 = session.Query<TimeSeriesIndex.Result, TimeSeriesIndex>()
    .Where(x => x.Count > 2)
    .ToList();

var query2 = session.Query<TimeSeriesIndex.Result, TimeSeriesIndex>()
    .Where(x => x.MostRecent.Value < 0.01)
    .ToList();

I think this might help you with #1 - #3 above.

4 is a bit more complex, and I'm not sure that I fully understand what

you are trying to achieve there..

but perhaps using javascript custom functions can give you the behaviour you desire : https://ravendb.net/docs/article-page/5.1/csharp/document-extensions/timeseries/querying/overview-and-syntax#syntax-declaring-a-time-series-function

declare timeseries tsFunc(doc){
    from doc.answerId 
    last 1 sec
    where Value = 100
}

declare function f(doc){
    var ids = [];
    for (var i = 0; i < doc.dependsOn.length; i++)
    {
        var c = doc.dependsOn[i];
        for (var j = 0; j < doc.questions.length; j++)
        {
            var q = doc.questions[j];
            ids.push(id(doc) + '/' + c + '/' + q);
        }
    }

    var docs = load(ids); // LoadDocument contentId/contactId/questionId

    var results = [];
    for (var i = 0; i < docs.length; i++)
    {
        var tsResult = tsFunc(docs[i]);
        if (tsResult.Count > 0)
            results.push(docs[i])   
    }

    return {
        Results : results
    }
}

from content as c
select f(c)

Hope this helps

jhancock4d commented 3 years ago

Thanks for this. Still digesting, but one of the core use cases I see for time series is Activity data. Login attempts, content viewing, test question answering etc.

In these cases, it is a time series because it's sorted by time and incremental, but it's cumulative, not a time window (i.e. I care about most recent logins, I don't care about most recent logins in the last 10 days because there may be no logins in that period for the given user) It appears per your first comment that this isn't possible right now because it's always limited by time.

And it also appears that it isn't possible to just roll up to most recent or at least get the most recent value (or object) in the time series, which is how you'd handle content viewing (i.e. as a person watches a video on youtube or Plex their system keeps track of percent complete for a given video Id and a given userId. Once they watch somewhere around 95% of it, they mark the completion as 100%. What you care about for statistics and recommendation systems is the most recent number, not something in a time window. You would also, want to look at statistics for a video for a creator and tell them the habits of how people watch videos and how long it takes to complete them (do they watch it end to end, pause and come back etc) and the same holds true for analysising data for advertisers. (they want to know the same thing)

The 4th use case is similar. If you've ever used Plex, you'll know they have an On Deck tool. It is effectively having to solve the same problem: IF I have a season of a TV show, my on deck will show me the next episode of said TV show that I haven't watched, but for which I have watched every previous episode. This is a recursive problem for which I'm attempting to use Time Series with Graph API and the recursive tooling. The difference between what we're doing and plex, is that it isn't just watched data, its "did you pass the test for what you watched? data". Again this is perfect for time series because a person answers questions about the content over time. Those correct answers give credit. You then reenforce the content by periodically asking questions again to see if the person still remembers and revoke credit if they can't get it so that they have to go back. The On Deck for the next content will revert if they fail the second time, but we still want to know their entire history of answering that specific question, even though they got it wrong, then got it right, then got it wrong again.

I'll go through what you have above, but I wanted to communicate these use cases because I think they're ideal for time series and adding the functionality make this happen would be hugely powerful. (it would also be nice if time series content was available strongly typed in your data objects, and was queryable by link.

ayende commented 3 years ago

Time Series in RavenDB are about recording a value over time. In the context of what you are talking about, you seem to want to record percentage of knowledge over time. Is that correct? I assume that this is over various topics for each user? And you want to show the most recent result per topic across them all?

If that is the case, the proper way to do that would be to utilize an index, which can present you with all these results at once.

jhancock4d commented 3 years ago

What I've done is like this:

contact {
   id: "contact/"
}

content {
   id: "content/",
   DependsOn: [
      (content with questionids)
   ],
   Questions: [question records that have a Guid id],
   MinimumCorrectAnswers: 2
}

contentActivity {
    id: "/contact/content"
    contentId: '',
    contactId: '',
    log: [List of view records with date],
    current: {On, PercentComplete}
}

contentQuestionAnswer {
    id: "/contact/content/question",
    contentId: ""
    questionId: ""
    contactId: ''
    log: [List of Answers given with date],
    current: {On, Answer}
}

Contacts will be in the hundreds of millions Content in the hundreds of millions contentActivity is % of content % of users and every log for every user will be probably a dozen or more rows. contentQuestionAnswer is % of content % of users and every log will be probably 2-3 rows.

On Deck = Content that you haven't gotten a the minimum number of questions correct with your answers but have that content's dependancies (or content that has no dependancies but you still want to watch) with passed by the same criteria.

Continue Watching = Content that you have a record in the contentActivity but the current percent complete is < 95.

Every single log is a natural time series. It's an entry by time with a new value (can be a percent complete, can be an object value)

Because there can be no dependancies for a piece of content, the query is conditional.

Here's my hack at it thus far (without using time series):

                var raw = await DbSession.Query<Content>()
                    .Select(c => new
                    {
                        Content = c,
                        DependsOn = c.DependsOn.Select(q => new
                        {
                            q,
                            Activity = RavenQuery.Load<QuestionAnswerActivity>(
                                string.Format(QuestionAnswerActivity.ID_TEMPLATE, q.Id, Identity.Id)
                            )
                        }),
                        Activity = c.Questions.Select(q =>
                            RavenQuery.Load<QuestionAnswerActivity>(
                                string.Format(QuestionAnswerActivity.ID_TEMPLATE, q.Id, Identity.Id)
                            )
                        )
                    }).Where(c => !c.DependsOn.Any() || c.DependsOn.Any(a => a.Activity != null && a.Activity.MostRecent.Answer.PercentCredit == 100)).ToArrayAsync();

                var results = raw.Where(c => !c.Activity.Any() &&
                        (!c.DependsOn.Any() ||
                            c.DependsOn.Count(d => d.Activity != null && d.Activity.MostRecent.Answer.PercentCredit == 100) >= (c.Content.CorrectAnswersRequired ?? 1)
                        )
                    ).Select(c => Mapper.Map<ContentRefDto>(c.Content))
                    .Take(10)
                    .ToArray();

Obviously this is non-optimal because I'm having to do things in memory, but when I put the logic into the first query that should be there, it won't work because it won't compute Count > CorrectAnswersRequired, so I get anything that you might be close on and then filter down in memory to minimize the impact.

And of course, this is going to generate a HUGE index as a result of even this query before putting it in as an index. (And the query requires the current user which isn't known just from the content table so you'd end up creating an index that is many to many to many.

Any other way I've tried I can't reference the associated documents, and putting billions of sub records inside a single content document or thousands of sub records inside of a content document causing concurrency issues seems like a really bad idea because as your docs say kilobytes I tried it that way, and it was very quickly over a meg and climbing in a hurry.

Graph API works with this stuff nicely in theory but I can't create the conditional edge on the depends on properly, and I can't create the compound keys with it as far as I can tell. And linq I run into all kinds of issues with Time series trying to get the rolled up most current so I'm using a List for now and a current item because again, I can't use .Last().answer in the query because it doesn't like that (and it would probably be slower anyhow).

These are the easy cases. I then have to use Graph Api to create every single possible permutation and combination of paths in the dependency tree and analyze those results to do assessments which will also be based on that data. (and is patented tech proven out with in memory linq, that I need to adapt to RavenDb)

Ideally what I was hoping is that I could create a count window on the time series, have raven manage that, get the most recent for all of the above, and use Graph Api to pull it all conditionally in a single computation which of course we'll cache in Redis and invalidate or update as they answer questions etc.

ayende commented 3 years ago

Is contact and user the same thing, in this case? Can you do something like:

// map
from c in docs.ContentQuestionAnswer 
select new 
{
   c.ContactId,
   c.ContentId,
   CorrectAnswers = c.Current = "On",
  MinimumCorrectAnswers = LoadDocument(c.ContentId, "Contents"),
  Complete = false
}

// reduce
from r in results group r by new { r.ContactId, r.ContentId } into g
let MinimumCorrectAnswers = g.First().MinimumCorrectAnswers
let CorrectAnswers = g.Sum(x=>x.CorrectAnswers)
select new {
  g.Key.ContentId,
  g.Key.ContactId,
  CorrectAnswers,
  MinimumCorrectAnswers,
  Complete = CorrectAnswers > MinimumCorrectAnswers
}

This gives you the output for all the completed contents, you can use an Output Collection, which you'll then be able to process further. If you can reverse the DependsOn structure, that would make things a lot easier, too. Since then you can do another index:

// map
from d in docs.CompletedCourses // prev output collection
where d.Complete  // not all are completed there
from next in LoadDocument(d.ContentId, "Contents").NextContents // reverse of DependsOn
select new 
{
   Next = new[]{ next },
   Completed = new[] { d.ContentId }, // for filtering
   d.ContactId
}

// reduce
from r in results group r by r.ContactId
into g
let completed = g.SelectMany(x=>x.Completed).Distinct().ToArray()
select new {
  ContactId = g.Key,
  Completed = completed,
  Next = g.SelectMany(x=>x.Next).Distinct().Except(completed)
}

That gives you an easy way to query the next content that you have per user. No costly queries.

All of that said, I would actually avoid contentActivity and contentActivity in general. I would recommend that you have:

The idea is that this represent all the work they did on that content. By not spreading this around, you can apply a lot more logic at that point. It also gives you the chance to handle this issue in a better manner. Instead of walking up and down the graph, you can mark the ContactContent as completed and then mark the next content that is opened up as a result.

This is most relevant since you are putting a lot of logic into those queries, and I don't think that is advisable. Better to move that to your own actions and then respond to that instead. For example, you'll have scenarios where "open that course for this user" or something like that based on special conditions. And re-indexing the whole thing because of that is really expensive.

jhancock4d commented 3 years ago

So if I have ContactContent the issue is that it would then look like this:

{
   contentId: "",
   contactId: "",
   percentComplete: 12,
   passed: true/false,
   questions: [
       questionref: {
          current: answer,
          log: [
             answer,
             answer,
             answer
          ]
       }
   ]
}

So the time series for answers is then on a question inside of the questions.

And the Document load becomes much more complex, because the questions are stored on the content, as well.

Is this what you would expect to see? The issue to me still seems to be that I need at least MinimumCorrectAnswers number of questions in the ContactContent document to be correct, and every time I try and do a count(a => a.Credit = 100) it tells me I can't compute in a where.

Using the above I wrote this which does it all in a single query nicely I thought...

                var results = await DbSession.Query<Content>()
                    .Select(c => new
                    {
                        Content = c,
                        DependsOn = c.DependsOn.Select(q => new
                        {
                            Content = q,
                            Activity = RavenQuery.Load<ContactContent>(
                                string.Format(ContactContent.ID_TEMPLATE, Identity.Id, q.Id)
                            )
                        }),
                        Activity = RavenQuery.Load<ContactContent>(
                                string.Format(ContactContent.ID_TEMPLATE, Identity.Id, c.Id)
                            )
                    })
                    .Where(c => c.Activity.CorrectAnswers < c.Content.CorrectAnswersRequired &&
                        (!c.DependsOn.Any() || c.DependsOn.All(a => a.Activity != null && a.Activity.CorrectAnswers >= a.Content.CorrectAnswersRequired)))
                    .Select(c => c.Content)
                    .ProjectTo<ContentRefDto>(Mapper.ConfigurationProvider)
                    .Take(10)
                    .ToArrayAsync();

But then I get the following:

Where clauses containing a Binary Expression between two fields are not supported. All Binary Expressions inside a Where clause should be between a field and a constant value. Convert(c.Activity.CorrectAnswers, Nullable1)andc.Content.CorrectAnswersRequiredare both fields, so cannot convert (Convert(c.Activity.CorrectAnswers, Nullable1) < c.Content.CorrectAnswersRequired) to a proper query.

Isn't this type of comparison pretty common? (Also apparently .All() isn't supported either which is required for this to work)

ayende commented 3 years ago

Yes, something like this. You can then define an index that would determine whatever there are sufficient correct answers, or just have a SufficentCorrectAnswers in the document directly.

ayende commented 3 years ago

In general, RavenDB favors doing the work in indexing time, rather than at query time. That also means that many business decisions should be outside the database entirely.