Closed zstumgoren closed 10 years ago
Add to_csv and to_json methods to Election, Candidate, Result models. Do not serialize references by default (include_refs=False).
Another (not mutually exclusive) approach would be to make these classes implement a (subset?) of the dictionary interface so the model instances can just be passed directly to json.dumps
and csv.DictWriter.writerow()
. I haven't looked deeply at the model implementations or the full bakery requirements to know if this is a good idea, but wanted to capture this idea somewhere.
Interesting. That could be a very clean approach and it's worth checking out.
Let's also explore MongoEngine's to_json queryset method. My main concern with that method is avoiding a major performance hit on serialization of referenced/related objects, such as Candidate and Office references. If memory serves, MongoEngine now supports "select_related", similar to Django. Would be great if we could spare ourselves some work by leaning on those MongoEngine features, but definitely needs further exploration.
We'll need to better spec this out, but one thing we'll need is the ability to bake out a customizable subset of fields. We should discuss this with Derek, but at minimum I think we'll need the ability to only bake out clean data (i.e. exclude fields prefixed with "raw_").
I think passing a model instance to serializers is a good way to go, if it's not too difficult. As Serdar says, we'll have some number of standardized attributes for a given Result
, but it might also make sense to include the raw
fields just to be transparent about what we've done.
@zstumgoren point noted about leaning on Mongo/MongoEngine for functionality, especially if it's really performant.
Derek -- for the first iteration, then, should we simply bake out everything? And then defer support for customizable data sets to some later point? That works for me if it works for you all.
Works for me.
@zstumgoren, in the spec for this feature, you said:
outputdir (default: openelections/us/bakery)
Is this what you mean? Or did you mean openelex/us/bakery
. I ask because, in my checkout, files seem to be getting put under openelections/openelex/us/
, where openelections
is the directory where I checked out this repo. For instance, Maryland's cached fetched CSV files are stored in path/to/openelections/openelex/us/md/cache
.
Yep, I meant openelex/us/bakery. Wanted to put this dir one level up, outside of state dirs (unlike cache), because we'll eventually have nationwide baked files for federal offices. Thanks for catching that!
Ps @ghing - don't feel constrained by the bullet points on this ticket. It just reflects a few minutes of strategizing and probably isn't the best implementation (for example, I don't think it mentions the mongoengine to_json method). Feel free to come up with an alternate implementation as long as it's relatively performant and flexible enough to handle custom mongo queries.
@zstumgoren, I won't be too constrained. In general the bullet points give me a good idea of the interface and the parts of things that need to be parameterized.
Is what needs to be baked out essentially openelex.models.Result
document instances, or do we need to "join" across a couple of different document types? Maybe a few rows of example output would be helpful, or we can schedule some time to chat about this in the next couple of days?
@ghing We'll need bits of data from Result, Candidate and Contest, as well as Office eventually. This is why I was concerned about dereferencing those related objects. We don't need web scale performance when baking out these data sets, since the baking will be done async. But they should be able to run in a "reasonable" amount of time (ideally on the order of 10-30 seconds). The pre-loading of related models into dictionaries is one strategy I use for this kind of thing, but it isn't ideal. The queryset _tojson method, combined with _selectrelated, would be much cleaner. If that proves unworkable, we should discuss your alternative of using the dict interface combined with passing model instances to a serializer. We can discuss this Friday at scrum, or if you hit wall before then, ping me and we'll talk offline.
@zstumgoren My first impression on playing around with select_related() is that it's still slow. I feel like I can make a better call on this when I understand the output data that we want. Let's chat!
Let's discuss more tomorrow, but a few initial thoughts:
Chatted with @zstumgoren yesterday about the fields in the baker output and here's a few notes:
Things Geoff will look into/we should talk about in scrum:
help_text
to model fields. Output field list and help text along with the data and manifest file in the baked output.Some thoughts:
help_text
@dwillis:
Not sure why Election dates need to be DateTimes
It looks like the reason is that MongoEngine only has DateTimeField and not separate field types for Date and DateTime. No big deal.
@ghing Ah, ok then.
@zstumgoren:
It looks like max_depth=1
vs. max_depth=2
for select_related()
doesn't make any difference.
Also select_related()
returns a list rather than a QuerySet, so we can't do something like qs.select_related().to_json()
or qs.select_related().as_pymongo()
.
I still need to look at the number of queries that actually happen with select_related()
and maybe the source to get a sense of what's going on.
State-specific bakers are the common case that we're starting with. The mention of nationwide files is more of a future consideration to keep in mind as he developers the bakery api.
Re DateTimeField, unlike Django, MongoEngine does not have a a plain DateField as far as I know. I haven't verified this by looking into the ME source, but I believe this is in keeping with Mongo itself, whose Date bson type is a a UTC datetime.
And another +1 on spitting out help_text.
@ghing Ok, so it's starting to sound like _selectrelated may not work for us. Let's make a final call once we get a handle on the query counts. Thanks!
@zstumgoren, select_related()
doesn't seem to generate that many more queries:
w/o select_related:
>>> with query_counter() as q:
... data = Result.objects.filter(state='MD', election_id__contains='2012')
... print q
... print data[0].contest
... print q
... print data[100].contest
... print q
...
0
md-2012-11-06-general-us-congress-06
2
md-2012-11-06-general-us-congress-07
4
w/ select_related():
>>> with query_counter() as q:
... data = Result.objects.filter(state='MD', election_id__contains='2012').select_related()
... print q
... print data[0].contest
... print q
... print data[100].contest
... print q
...
24
md-2012-11-06-general-us-congress-06
24
md-2012-11-06-general-us-congress-07
24
BTW, that query returns almost 73,000 documents for me. I wonder if the slowdown is less with the queries and more with how things are dereferenced.
I may do some more in-depth profiling to get more insight into this. I might also whip together a quick version that preloads the candidate and contest records to get a quick comparison. I'll have more to report at standup.
@ghing Nice digging on this. Have to say I'm surprised. I assumed we'd see our query count go down, not up, with _selectrelated. The numbers are likely even worse than you see, however, because serialization should trigger additional queries for the Result.candidate and possibly for Result.candidate.contest.
But yeah, i think we're headed towards a homegrown solution here. Let's discuss more tomorrow.
@zstumgoren, the query count does go down with select_related()
the counter being output is the aggregate counter. So, with select_related()
, it's making 24 queries. Boom. Done. Without select related it's making 2 queries for each item accessed, which adds up to a lot.
So, for our use case, where we need to dereference a lot of records, select_related()
will save us a lot of queries. It's just doesn't seem fast enough to me. I'll have more info on this by standup.
After a little more digging, I'm pretty convinced that select_related()
isn't the way to go. The biggest reason isn't the number of queries, but the expense of building MongoEngine models from pymongo dictionaries only to convert them back to something flat and serializeable. I'll explain more during standup.
Still todo:
I put some sample baker output in the Sample Baker Output
folder in the gdocs.
Right now the field names come from mongo, with a prefix added to indicate their collection. Should the field names be transformed to conform to https://github.com/openelections/specs/wiki?
That would be my preference, but the spec isn't totally frozen, either. Serdar?
Yep, what Derek said. We can also eventually produce a slimmed down output that only includes fields specified in the official spec.
@dwillis, @zstumgoren, as of 1c9e8e760449c2af1dadf540e7d7f7766cfe324d, the field names that we can translate from the mongo names to the spec names are translated in the output and there's an interface to change these translations in a declarative way as the spec changes.
As I was implementing this, I found that there was some fields in the spec that aren't in the model definition, e.g. pct
for Result.
Vote pct won't always be calculated for us, though it's a key value-add that we'd like to make part of the transform step in our data pipeline. If it is provided for us, the transform step should also perform the calculation, which can then be validated in the next phase of the pipeline.
_votepct is important enough that it probably merits its own _raw_votepct and _votepct fields on the Result model. The former would store pct if provided in raw results, otherwise would be empty; the latter would store our calculated pct.
We'd need to review the other fields in the spec that are not reflected in the models to determine whether they deserve formal model fields or make more sense as dynamic attributes on the model, or perhaps stuffed into an embedded object similar to _votebreakdowns.
@ghing Can we create a new ticket to centralize discussion of the missing spec fields?
Note to self: I've fallen out of like with the dotted notation of nested fields. I wanted to check that just merging everything without prefixing the field names wouldn't cause any overwriting. I ran:
grep Field openelex/models.py | grep '=' | sed 's/^[ ]*//' | sed 's/\(^.*\) = .*/\1/' | uniq -d
and it didn't show any output, so I think I'm good to just merge stuff in. Now that there's the framework for rewriting field names, this isn't a big problem.
Update: I'm bad at shell. You have to sort the output before running uniq
. The command should be:
grep Field models.py | grep '=' | sed 's/^[ ]*//' | sed 's/\(^.*\) = .*/\1/' | sort | uniq -d
That means the following fields appear on more than one models:
Merged a first pass into the dev branch from my feature branch.
Closing this as I believe I believe I've implemented core functionality. To avoid confusion, I'd like to address any further issues/refinements to the baker in separate issues.