twoixter / trackoid

Trackoid is an easy scalable analytics tracker using MongoDB and Mongoid
MIT License
88 stars 16 forks source link

Mongoid scope? #10

Open nickw opened 13 years ago

nickw commented 13 years ago

Is it possible to create a Mongoid scope, or even a manual MongoDB query to sort documents by their number of tracked items? For example Page.desc(:visits)?

I've tried various techniques but since the data is nested in a Tracker instance I don't seem to have access to it.

twoixter commented 13 years ago

I'm afraid the short answer is, no. :-(

And now for the long answer, because this is a very, very, very interesting question...

In previous versions you were able to do this using manual MongoDB queries, since the nested hashes had a simple structure like "year.month.day", so in practice you were able to query something like:

db.analytics.find( {"views_data.2011.09.06" : { $exists : true }} )

That is, the above query returns documents in the "analytics" collection which had data in the ":views" tracking field for September 6th, 2011. (The query is in the mongo shell syntax, but you can do this using the raw mongodb Ruby driver as well). We take advantage of the "dotted" syntax for nested objects. Also, Trackoid appends "_data" to the field name, so a field named ":views" gets saved in MongoDB as "views_data".

Because the "views_data.2011.09.06" is a leaf node, it is an integer, and you could potentially add the "sort" command to retrieve a sorted list of documents which had views in any given day, but NOT months, since "views_data.2011.09" is not a leaf node (is another hash).

Now, with the Time Zone support, I had to drop this simple syntax for the nested hashes, and opted for a more compact number for dates, so instead of a three level hash, I made a two level hash adding support for "hours". The new structure uses the date as the index of the first level and the hour of the hit as the second level:

@date = Time.new.utc.to_i / 24.hours
@hour = Time.new.utc.hour

You have an in-deep discussion of the new structure in the README, but it boils down to now you can't know the hits for a given day unless you sum up all hours.

SQL Databases CAN return an ordered query based on the sum of several columns, for example:

SELECT hour00 + hour01 + hour02 + ... + hour23 AS total ORDER BY total DESC

Which is similar to the problem of the hours array for Trackoid, but as far as I know, this is not possible with MongoDB. And to be honest, I don't like it because this implies the SORT must be done in-memory, by using an operator defeats the index completely.

Hopefully there is a solution using map/reduce operations with MongoDB, but I didn't have the time to test it.

twoixter commented 13 years ago

Sorry, I clicked on "Comment & close" by mitake! :-D I'm reopening it so that we can discuss it further...

nickw commented 13 years ago

Thanks for detailed answer.

For now, I'm thinking of implementing a simple counter cache attribute as a quick and dirty workaround. Although I wonder if that would be something worth adding to the gem itself? You could perhaps create a dynamic attribute like #{trackable_attr}_total_count and then auto update it in the Tracker#add method?

twoixter commented 13 years ago

Yes, absolutely, I think this might work. In fact, when I was thinking about the internal structure for the hash, I was thinking of adding a "total" hash member alongside the other year / month / day levels as a cache. I mean, imagine the following structure:

"trackable_attr_data" : {
    "2011" : {
        "08" : {
            "30" : <Hits for 08/30/2011>,
            "total" : <Total hits for August, 2011>
        },
        "09" : {
            "01" : <Hits for 09/01/2011>,
            "total" : <Total hits for September, 2011>
        }
        "total" : <Total hits for 2011>
    }
    "total" : <Total hits overall>
}

Why I didn't include this "total" cache? To easy internal Ruby handling of this. Although a great idea, it was no critical for our project to include it, so it didn't get into the final code.

@data["2011"]["08"].sum

It's easier and faster than doing a:

@data["2011"]["08"].inject { ...}

Well, you get the picture. :-)

With the inclusion of the Time Zone code after v0.3 this gets worse, because we lose the ability to have a "total" count for a year or a month. Which leads me to really think of going back to using the old structure and only activating the new TZ structure using an option:

track :visits, :use_tz => true   # Or something like that

This way you can selectively use the old or the new structure, because I think not all projects need to be 100% TZ aware as our project was (@webpop).