Add bakery to invoke framework

zstumgoren commented 10 years ago

Write tests using FactoryBoy
Create base/bake.py module with Baker class
- Executes Mongo query constructed from bake invoke task's CLI params
- Preloads Election and Candidate records, by key, into memory
- Writes batched data (default batch of 5000) as stream to a target file
Add _tocsv and _tojson methods to Election, Candidate, Result models. Do not serialize references by default (include_refs=False).
Create _bake.statefile invoke task
- Accepts filters for customizing result output
- state (required) - postal code
- format (default: CSV) - format for output
- outputdir (default: openelections/us/bakery)
- date [YYYY|YYYY-MM-DD] (require to avoid huge file sizes?)
- type [general|primary|special|etc] - use choices list from models.py
- office - most useful after data standardized
- district - most useful after data standardized
- party - most useful after data standardized
- reporting level [state|county|precinct| etc.] - use choices list from models.py
- Sets sensible defaults. For example, output all state/contest-wide results for all races when no filters are applied.
- Writes two files to openelections/us/bakery folder:
- ._ - the baked results
- manifest.txt - query parameters used to produce the result

ghing commented 10 years ago

Add to_csv and to_json methods to Election, Candidate, Result models. Do not serialize references by default (include_refs=False).

Another (not mutually exclusive) approach would be to make these classes implement a (subset?) of the dictionary interface so the model instances can just be passed directly to json.dumps and csv.DictWriter.writerow(). I haven't looked deeply at the model implementations or the full bakery requirements to know if this is a good idea, but wanted to capture this idea somewhere.

zstumgoren commented 10 years ago

Interesting. That could be a very clean approach and it's worth checking out.

Let's also explore MongoEngine's to_json queryset method. My main concern with that method is avoiding a major performance hit on serialization of referenced/related objects, such as Candidate and Office references. If memory serves, MongoEngine now supports "select_related", similar to Django. Would be great if we could spare ourselves some work by leaning on those MongoEngine features, but definitely needs further exploration.

We'll need to better spec this out, but one thing we'll need is the ability to bake out a customizable subset of fields. We should discuss this with Derek, but at minimum I think we'll need the ability to only bake out clean data (i.e. exclude fields prefixed with "raw_").

dwillis commented 10 years ago

I think passing a model instance to serializers is a good way to go, if it's not too difficult. As Serdar says, we'll have some number of standardized attributes for a given Result, but it might also make sense to include the raw fields just to be transparent about what we've done.

ghing commented 10 years ago

@zstumgoren point noted about leaning on Mongo/MongoEngine for functionality, especially if it's really performant.

zstumgoren commented 10 years ago

Derek -- for the first iteration, then, should we simply bake out everything? And then defer support for customizable data sets to some later point? That works for me if it works for you all.

dwillis commented 10 years ago

Works for me.

ghing commented 10 years ago

@zstumgoren, in the spec for this feature, you said:

outputdir (default: openelections/us/bakery)

Is this what you mean? Or did you mean openelex/us/bakery. I ask because, in my checkout, files seem to be getting put under openelections/openelex/us/, where openelections is the directory where I checked out this repo. For instance, Maryland's cached fetched CSV files are stored in path/to/openelections/openelex/us/md/cache.

zstumgoren commented 10 years ago

Yep, I meant openelex/us/bakery. Wanted to put this dir one level up, outside of state dirs (unlike cache), because we'll eventually have nationwide baked files for federal offices. Thanks for catching that!

zstumgoren commented 10 years ago

Ps @ghing - don't feel constrained by the bullet points on this ticket. It just reflects a few minutes of strategizing and probably isn't the best implementation (for example, I don't think it mentions the mongoengine to_json method). Feel free to come up with an alternate implementation as long as it's relatively performant and flexible enough to handle custom mongo queries.

ghing commented 10 years ago

@zstumgoren, I won't be too constrained. In general the bullet points give me a good idea of the interface and the parts of things that need to be parameterized.

Is what needs to be baked out essentially openelex.models.Result document instances, or do we need to "join" across a couple of different document types? Maybe a few rows of example output would be helpful, or we can schedule some time to chat about this in the next couple of days?

zstumgoren commented 10 years ago

@ghing We'll need bits of data from Result, Candidate and Contest, as well as Office eventually. This is why I was concerned about dereferencing those related objects. We don't need web scale performance when baking out these data sets, since the baking will be done async. But they should be able to run in a "reasonable" amount of time (ideally on the order of 10-30 seconds). The pre-loading of related models into dictionaries is one strategy I use for this kind of thing, but it isn't ideal. The queryset _tojson method, combined with _selectrelated, would be much cleaner. If that proves unworkable, we should discuss your alternative of using the dict interface combined with passing model instances to a serializer. We can discuss this Friday at scrum, or if you hit wall before then, ping me and we'll talk offline.

ghing commented 10 years ago

@zstumgoren My first impression on playing around with select_related() is that it's still slow. I feel like I can make a better call on this when I understand the output data that we want. Let's chat!

zstumgoren commented 10 years ago

Let's discuss more tomorrow, but a few initial thoughts:

I really like the Tastypie-like abstraction layer you've integrated into the Baker. nicely done.
Yep, let's discuss the final output data this week in more detail.
Before we call it quits on _selectrelated, lets get some basic metrics using the _querycounter function (_from mongoengine.context_managers import querycounter (see the ME tests for an example). Also, if you haven't already, can you experiment with changing the _max_depth_ parameter to 2? If you're calling _selectrelated on Result, I'm wondering if we're seeing a performance hit due to the extra Contest reference on Candidate. The denormalized contest reference on Result should in theory be caught by the default max_depth of 1, but we'd likely be taking a hit on Candidate model's reference in that scenario. Of course, might still be too slow even with that param tweak, so yep, let's definitely discuss alternate strategies.

ghing commented 10 years ago

Chatted with @zstumgoren yesterday about the fields in the baker output and here's a few notes:

First iteration output should contain a denormalized (flattened) copy of all fields of results and its related contest, candidate, and (eventually) office models.
- Reflects @dwillis' "more is better" preference.
Keep in mind that eventually we'll want to bake across states, but this doesn't make sense at this time since we don't have all the states completed
CSV is the primary target
JSON and CSV will have same set of flattened fields, i.e. flatten rather than embed related objects in JSON output, even though JSON can support hierarchical/embedded data.
Election dates, though represented as DateTime fields in the models, don't actually have a relevant time component.
When baking out sample data, MD should be mostly ok, esp. for 2012, but likely some missing data.

Things Geoff will look into/we should talk about in scrum:

Best date format for dates in CSV to "just work" in Excel, Google Docs? I'm not a huge excel user. Basically just need to ensure that YYYY-MM-DD gets parsed as a date in Excel.
Do we want to transform any of the field names/data before writing them to make them more usable by eyeball inspection or with common tools?
- Bonus idea: Add help_text to model fields. Output field list and help text along with the data and manifest file in the baked output.

dwillis commented 10 years ago

Some thoughts:

While we will want to bake across states, it also will be very useful to bake a single state since, with special elections, we will have situations in which just one state's election results need to be republished.
Not sure why Election dates need to be DateTimes.
YYYY-MM-DD does get parsed as an date in Excel.
+1 to outputting field list and help_text

ghing commented 10 years ago

@dwillis:

Not sure why Election dates need to be DateTimes

It looks like the reason is that MongoEngine only has DateTimeField and not separate field types for Date and DateTime. No big deal.

dwillis commented 10 years ago

@ghing Ah, ok then.

ghing commented 10 years ago

@zstumgoren:

It looks like max_depth=1 vs. max_depth=2 for select_related() doesn't make any difference.

Also select_related() returns a list rather than a QuerySet, so we can't do something like qs.select_related().to_json() or qs.select_related().as_pymongo().

I still need to look at the number of queries that actually happen with select_related() and maybe the source to get a sense of what's going on.

zstumgoren commented 10 years ago

State-specific bakers are the common case that we're starting with. The mention of nationwide files is more of a future consideration to keep in mind as he developers the bakery api.

Re DateTimeField, unlike Django, MongoEngine does not have a a plain DateField as far as I know. I haven't verified this by looking into the ME source, but I believe this is in keeping with Mongo itself, whose Date bson type is a a UTC datetime.

And another +1 on spitting out help_text.

zstumgoren commented 10 years ago

@ghing Ok, so it's starting to sound like _selectrelated may not work for us. Let's make a final call once we get a handle on the query counts. Thanks!

ghing commented 10 years ago

@zstumgoren, select_related() doesn't seem to generate that many more queries:

w/o select_related:

>>> with query_counter() as q:
...     data = Result.objects.filter(state='MD', election_id__contains='2012')
...     print q
...     print data[0].contest
...     print q
...     print data[100].contest
...     print q
... 
0
md-2012-11-06-general-us-congress-06
2
md-2012-11-06-general-us-congress-07
4

w/ select_related():

>>> with query_counter() as q:
...     data = Result.objects.filter(state='MD', election_id__contains='2012').select_related()
...     print q
...     print data[0].contest
...     print q
...     print data[100].contest
...     print q
... 
24
md-2012-11-06-general-us-congress-06
24
md-2012-11-06-general-us-congress-07
24

BTW, that query returns almost 73,000 documents for me. I wonder if the slowdown is less with the queries and more with how things are dereferenced.

I may do some more in-depth profiling to get more insight into this. I might also whip together a quick version that preloads the candidate and contest records to get a quick comparison. I'll have more to report at standup.

zstumgoren commented 10 years ago

@ghing Nice digging on this. Have to say I'm surprised. I assumed we'd see our query count go down, not up, with _selectrelated. The numbers are likely even worse than you see, however, because serialization should trigger additional queries for the Result.candidate and possibly for Result.candidate.contest.

But yeah, i think we're headed towards a homegrown solution here. Let's discuss more tomorrow.

ghing commented 10 years ago

@zstumgoren, the query count does go down with select_related() the counter being output is the aggregate counter. So, with select_related(), it's making 24 queries. Boom. Done. Without select related it's making 2 queries for each item accessed, which adds up to a lot.

So, for our use case, where we need to dereference a lot of records, select_related() will save us a lot of queries. It's just doesn't seem fast enough to me. I'll have more info on this by standup.

ghing commented 10 years ago

After a little more digging, I'm pretty convinced that select_related() isn't the way to go. The biggest reason isn't the number of queries, but the expense of building MongoEngine models from pymongo dictionaries only to convert them back to something flat and serializeable. I'll explain more during standup.

ghing commented 10 years ago

Still todo:

[ ] Figure out how/if we want to flatten structured fields. Right now these look like:
- candidate.identifiers
- candidate.other_names
- candidate.parties
- candidate.raw_parties
- raw_vote_breakdowns
[ ] Some fixture-based tests

ghing commented 10 years ago

I put some sample baker output in the Sample Baker Output folder in the gdocs.

Right now the field names come from mongo, with a prefix added to indicate their collection. Should the field names be transformed to conform to https://github.com/openelections/specs/wiki?

dwillis commented 10 years ago

That would be my preference, but the spec isn't totally frozen, either. Serdar?

zstumgoren commented 10 years ago

Yep, what Derek said. We can also eventually produce a slimmed down output that only includes fields specified in the official spec.

ghing commented 10 years ago

@dwillis, @zstumgoren, as of 1c9e8e760449c2af1dadf540e7d7f7766cfe324d, the field names that we can translate from the mongo names to the spec names are translated in the output and there's an interface to change these translations in a declarative way as the spec changes.

As I was implementing this, I found that there was some fields in the spec that aren't in the model definition, e.g. pct for Result.

zstumgoren commented 10 years ago

Vote pct won't always be calculated for us, though it's a key value-add that we'd like to make part of the transform step in our data pipeline. If it is provided for us, the transform step should also perform the calculation, which can then be validated in the next phase of the pipeline.

_votepct is important enough that it probably merits its own _raw_votepct and _votepct fields on the Result model. The former would store pct if provided in raw results, otherwise would be empty; the latter would store our calculated pct.

We'd need to review the other fields in the spec that are not reflected in the models to determine whether they deserve formal model fields or make more sense as dynamic attributes on the model, or perhaps stuffed into an embedded object similar to _votebreakdowns.

@ghing Can we create a new ticket to centralize discussion of the missing spec fields?

ghing commented 10 years ago

Note to self: I've fallen out of like with the dotted notation of nested fields. I wanted to check that just merging everything without prefixing the field names wouldn't cause any overwriting. I ran:

grep Field openelex/models.py | grep '=' | sed 's/^[ ]*//' | sed 's/\(^.*\) = .*/\1/'  | uniq -d

~~and it didn't show any output, so I think I'm good to just merge stuff in. Now that there's the framework for rewriting field names, this isn't a big problem.~~

Update: I'm bad at shell. You have to sort the output before running uniq. The command should be:

grep Field models.py | grep '=' | sed 's/^[ ]*//' | sed 's/\(^.*\) = .*/\1/' | sort | uniq -d

That means the following fields appear on more than one models:

contest
contest_slug
district
election_id
party
raw_party
slug
source
state

ghing commented 10 years ago

Merged a first pass into the dev branch from my feature branch.

Closing this as I believe I believe I've implemented core functionality. To avoid confusion, I'd like to address any further issues/refinements to the baker in separate issues.

openelections / openelections-core

Add bakery to invoke framework #39