openelections / openelections-core

Core repo for election results data acquisition, transformation and output.
MIT License
176 stars 96 forks source link

Add bakery to invoke framework #39

Closed zstumgoren closed 10 years ago

zstumgoren commented 10 years ago
ghing commented 10 years ago

Add to_csv and to_json methods to Election, Candidate, Result models. Do not serialize references by default (include_refs=False).

Another (not mutually exclusive) approach would be to make these classes implement a (subset?) of the dictionary interface so the model instances can just be passed directly to json.dumps and csv.DictWriter.writerow(). I haven't looked deeply at the model implementations or the full bakery requirements to know if this is a good idea, but wanted to capture this idea somewhere.

zstumgoren commented 10 years ago

Interesting. That could be a very clean approach and it's worth checking out.

Let's also explore MongoEngine's to_json queryset method. My main concern with that method is avoiding a major performance hit on serialization of referenced/related objects, such as Candidate and Office references. If memory serves, MongoEngine now supports "select_related", similar to Django. Would be great if we could spare ourselves some work by leaning on those MongoEngine features, but definitely needs further exploration.

We'll need to better spec this out, but one thing we'll need is the ability to bake out a customizable subset of fields. We should discuss this with Derek, but at minimum I think we'll need the ability to only bake out clean data (i.e. exclude fields prefixed with "raw_").

dwillis commented 10 years ago

I think passing a model instance to serializers is a good way to go, if it's not too difficult. As Serdar says, we'll have some number of standardized attributes for a given Result, but it might also make sense to include the raw fields just to be transparent about what we've done.

ghing commented 10 years ago

@zstumgoren point noted about leaning on Mongo/MongoEngine for functionality, especially if it's really performant.

zstumgoren commented 10 years ago

Derek -- for the first iteration, then, should we simply bake out everything? And then defer support for customizable data sets to some later point? That works for me if it works for you all.

dwillis commented 10 years ago

Works for me.

ghing commented 10 years ago

@zstumgoren, in the spec for this feature, you said:

outputdir (default: openelections/us/bakery)

Is this what you mean? Or did you mean openelex/us/bakery. I ask because, in my checkout, files seem to be getting put under openelections/openelex/us/, where openelections is the directory where I checked out this repo. For instance, Maryland's cached fetched CSV files are stored in path/to/openelections/openelex/us/md/cache.

zstumgoren commented 10 years ago

Yep, I meant openelex/us/bakery. Wanted to put this dir one level up, outside of state dirs (unlike cache), because we'll eventually have nationwide baked files for federal offices. Thanks for catching that!

zstumgoren commented 10 years ago

Ps @ghing - don't feel constrained by the bullet points on this ticket. It just reflects a few minutes of strategizing and probably isn't the best implementation (for example, I don't think it mentions the mongoengine to_json method). Feel free to come up with an alternate implementation as long as it's relatively performant and flexible enough to handle custom mongo queries.

ghing commented 10 years ago

@zstumgoren, I won't be too constrained. In general the bullet points give me a good idea of the interface and the parts of things that need to be parameterized.

Is what needs to be baked out essentially openelex.models.Result document instances, or do we need to "join" across a couple of different document types? Maybe a few rows of example output would be helpful, or we can schedule some time to chat about this in the next couple of days?

zstumgoren commented 10 years ago

@ghing We'll need bits of data from Result, Candidate and Contest, as well as Office eventually. This is why I was concerned about dereferencing those related objects. We don't need web scale performance when baking out these data sets, since the baking will be done async. But they should be able to run in a "reasonable" amount of time (ideally on the order of 10-30 seconds). The pre-loading of related models into dictionaries is one strategy I use for this kind of thing, but it isn't ideal. The queryset _tojson method, combined with _selectrelated, would be much cleaner. If that proves unworkable, we should discuss your alternative of using the dict interface combined with passing model instances to a serializer. We can discuss this Friday at scrum, or if you hit wall before then, ping me and we'll talk offline.

ghing commented 10 years ago

@zstumgoren My first impression on playing around with select_related() is that it's still slow. I feel like I can make a better call on this when I understand the output data that we want. Let's chat!

zstumgoren commented 10 years ago

Let's discuss more tomorrow, but a few initial thoughts:

ghing commented 10 years ago

Chatted with @zstumgoren yesterday about the fields in the baker output and here's a few notes:

Things Geoff will look into/we should talk about in scrum:

dwillis commented 10 years ago

Some thoughts:

ghing commented 10 years ago

@dwillis:

Not sure why Election dates need to be DateTimes

It looks like the reason is that MongoEngine only has DateTimeField and not separate field types for Date and DateTime. No big deal.

dwillis commented 10 years ago

@ghing Ah, ok then.

ghing commented 10 years ago

@zstumgoren:

It looks like max_depth=1 vs. max_depth=2 for select_related() doesn't make any difference.

Also select_related() returns a list rather than a QuerySet, so we can't do something like qs.select_related().to_json() or qs.select_related().as_pymongo().

I still need to look at the number of queries that actually happen with select_related() and maybe the source to get a sense of what's going on.

zstumgoren commented 10 years ago

State-specific bakers are the common case that we're starting with. The mention of nationwide files is more of a future consideration to keep in mind as he developers the bakery api.

Re DateTimeField, unlike Django, MongoEngine does not have a a plain DateField as far as I know. I haven't verified this by looking into the ME source, but I believe this is in keeping with Mongo itself, whose Date bson type is a a UTC datetime.

And another +1 on spitting out help_text.

zstumgoren commented 10 years ago

@ghing Ok, so it's starting to sound like _selectrelated may not work for us. Let's make a final call once we get a handle on the query counts. Thanks!

ghing commented 10 years ago

@zstumgoren, select_related() doesn't seem to generate that many more queries:

w/o select_related:

>>> with query_counter() as q:
...     data = Result.objects.filter(state='MD', election_id__contains='2012')
...     print q
...     print data[0].contest
...     print q
...     print data[100].contest
...     print q
... 
0
md-2012-11-06-general-us-congress-06
2
md-2012-11-06-general-us-congress-07
4

w/ select_related():

>>> with query_counter() as q:
...     data = Result.objects.filter(state='MD', election_id__contains='2012').select_related()
...     print q
...     print data[0].contest
...     print q
...     print data[100].contest
...     print q
... 
24
md-2012-11-06-general-us-congress-06
24
md-2012-11-06-general-us-congress-07
24

BTW, that query returns almost 73,000 documents for me. I wonder if the slowdown is less with the queries and more with how things are dereferenced.

I may do some more in-depth profiling to get more insight into this. I might also whip together a quick version that preloads the candidate and contest records to get a quick comparison. I'll have more to report at standup.

zstumgoren commented 10 years ago

@ghing Nice digging on this. Have to say I'm surprised. I assumed we'd see our query count go down, not up, with _selectrelated. The numbers are likely even worse than you see, however, because serialization should trigger additional queries for the Result.candidate and possibly for Result.candidate.contest.

But yeah, i think we're headed towards a homegrown solution here. Let's discuss more tomorrow.

ghing commented 10 years ago

@zstumgoren, the query count does go down with select_related() the counter being output is the aggregate counter. So, with select_related(), it's making 24 queries. Boom. Done. Without select related it's making 2 queries for each item accessed, which adds up to a lot.

So, for our use case, where we need to dereference a lot of records, select_related() will save us a lot of queries. It's just doesn't seem fast enough to me. I'll have more info on this by standup.

ghing commented 10 years ago

After a little more digging, I'm pretty convinced that select_related() isn't the way to go. The biggest reason isn't the number of queries, but the expense of building MongoEngine models from pymongo dictionaries only to convert them back to something flat and serializeable. I'll explain more during standup.

ghing commented 10 years ago

Still todo:

ghing commented 10 years ago

I put some sample baker output in the Sample Baker Output folder in the gdocs.

Right now the field names come from mongo, with a prefix added to indicate their collection. Should the field names be transformed to conform to https://github.com/openelections/specs/wiki?

dwillis commented 10 years ago

That would be my preference, but the spec isn't totally frozen, either. Serdar?

zstumgoren commented 10 years ago

Yep, what Derek said. We can also eventually produce a slimmed down output that only includes fields specified in the official spec.

ghing commented 10 years ago

@dwillis, @zstumgoren, as of 1c9e8e760449c2af1dadf540e7d7f7766cfe324d, the field names that we can translate from the mongo names to the spec names are translated in the output and there's an interface to change these translations in a declarative way as the spec changes.

As I was implementing this, I found that there was some fields in the spec that aren't in the model definition, e.g. pct for Result.

zstumgoren commented 10 years ago

Vote pct won't always be calculated for us, though it's a key value-add that we'd like to make part of the transform step in our data pipeline. If it is provided for us, the transform step should also perform the calculation, which can then be validated in the next phase of the pipeline.

_votepct is important enough that it probably merits its own _raw_votepct and _votepct fields on the Result model. The former would store pct if provided in raw results, otherwise would be empty; the latter would store our calculated pct.

We'd need to review the other fields in the spec that are not reflected in the models to determine whether they deserve formal model fields or make more sense as dynamic attributes on the model, or perhaps stuffed into an embedded object similar to _votebreakdowns.

@ghing Can we create a new ticket to centralize discussion of the missing spec fields?

ghing commented 10 years ago

Note to self: I've fallen out of like with the dotted notation of nested fields. I wanted to check that just merging everything without prefixing the field names wouldn't cause any overwriting. I ran:

grep Field openelex/models.py | grep '=' | sed 's/^[ ]*//' | sed 's/\(^.*\) = .*/\1/'  | uniq -d

and it didn't show any output, so I think I'm good to just merge stuff in. Now that there's the framework for rewriting field names, this isn't a big problem.

Update: I'm bad at shell. You have to sort the output before running uniq. The command should be:

grep Field models.py | grep '=' | sed 's/^[ ]*//' | sed 's/\(^.*\) = .*/\1/' | sort | uniq -d

That means the following fields appear on more than one models:

ghing commented 10 years ago

Merged a first pass into the dev branch from my feature branch.

Closing this as I believe I believe I've implemented core functionality. To avoid confusion, I'd like to address any further issues/refinements to the baker in separate issues.