opencivicdata / docs.opencivicdata.org

Open Civic Data project documentation
https://open-civic-data.readthedocs.io
44 stars 33 forks source link

Adding first stab at a first draft of a proposal for campaign finance… #61

Closed aepton closed 7 years ago

aepton commented 7 years ago

… filing models

Just wanted to make sure I was on something like the right track before I filled in more details.

fgregg commented 7 years ago

Thanks for starting on this! paging @boblannon @evz @palewire, @gordonje

When I look at a filing document, I want to know two things

  1. the information contained within this filing
  2. the business logic that determines what information this filing should include. For example, contributions over $250 should be itemized and include name, address, date of contribution.

Knowing the business rules is critical to interpreting the information in a filing, but I think those business rules should be represented separately.

Been thinking about a FilingType model, that a Filing object has a relation to. These business rules change over time, while the name of type of filing sometimes does not, so it would be important to be able to represent the business rules as being connected to certain time spans.

aepton commented 7 years ago

Do we even want to represent the business rules at all? That seems like a lot of extraneous work, in particular since, as you said, the rules change all the time.

I was thinking that we want to have a really loose/barebones model for a filing itself, and then some FilingType that describes that filing, but not encode the rules at all. The rules would be implicit based on the data contained in the filing - this Filing, of type ILContributionReport, contains Contributions. That's what we're getting from the Regulator, so that's kind of all that matters - if the rules say "this has to include all contribs over $250", we can't enforce those rules or even (in many cases) know if they're being violated, and keeping up to date with all the legislative changes would be quite a pain.

fgregg commented 7 years ago

@aepton I think that you are right that a Filing object need not be dependent on the existence of FilingType object.

That said, it still might be useful to think about these as distinct models (even if don't get around to implementing FilintTypes), because it might help us avoid putting business rules into Filing objects

palewire commented 7 years ago

I'm thrilled to see this ball rolling. @aepton, when you're ready for comments on your early submission please let me know.

aepton commented 7 years ago

Ok, I think this captures where I'd like to start the conversation. Please have at it with any and all types of suggestions/tweaks/fixes/jaw-droppingly-obvious omissions/subtle whatevers/I should probably just end this sentence, you get it.

fgregg commented 7 years ago

@aepton thanks for this great start. I have three general comments at this point.

  1. I think that many of the models that you describe could already be covered by existing OCD models, I've noted the relevant models as in-line comments.

  2. I think the Election and Candidate models should be pulled out into separate PR. It would be great to get some of the openelex folks to look at that, separately.

  3. There seems to be two philosophies of how to represent contributions and expenditures. Personally, I think that we should treat the contributions that are represented a filing as claims not facts. I think that we should represent the data as entered with minimal inferences about the identity or veracity of those claims.

Basically, this comes down to representing a filing as denormalized row, versus as the relation between modelled entities. I would strongly prefer the denormalized representation.

I think that we can have attach an optional ocd-person-id and *ocd-organziation-id' to the denormalized representations to make downstream processing much easier.

aepton commented 7 years ago

Thanks, Forrest, this is really awesome and helpful. Updating this PR now and I'll pull out Election and Candidate into a separate proposal. I'm with you on modeling contributions and expenditures - as a data utility, we want to do nothing beyond providing what other folks are claiming. Then as journalists or whomever, we can use this data to model things and make assertions and inferences - and I want to make the latter as easy as possible without compromising the design of the former.

gordonje commented 7 years ago

I've added a few of my own comments (apologies for taking my sweet time!). Overall, really like the direction in which we are headed.

Also wanted to touch on FilingTypes: Are we imagining this as a means of modeling what we at CCDC call the [Filing Forms(http://calaccess.californiacivicdata.org/documentation/calaccess-forms/)? These include:

If so, I wonder if "FilingForm" or "FilingFormat" might be a name that more specifically describes this object but is still general enough to cover all cases. Don't mean to quibble too much about names, but the business rules we are talking about (and their changes) are often most clearly described in reference to these forms. To that point: the instructions for completing and submitting the filings often are communicated directly on the forms. At least that has been my experience in CA.

Maybe there are other examples of FilingTypes I'm not thinking about. The raw CAL-ACCESS data also has a concept of "Statement Type", which is meant represent a real mish-mash of categorizations, including

But maybe a lot of this stuff is accounted for elsewhere, though maybe not as directly as we would like. For example, one can infer the length of the filing period for the filing_coverage_begin and end_date attributes.

aepton commented 7 years ago

@gordonje yeah, I think the Filing Type object is meant to represent what you describe. I think Type is a better name than Form because many of these "forms" aren't really paper forms anymore; a lot is disclosed electronically. I don't want to tie us too closely to the notion of specific forms in particular jurisdictions; it's definitely meaningful to talk about what's in a specific campaign's report, or a last-minute filing report, or a quarterly report, or something like that, so it's worth modeling those in the DB. But they're essentially just bundles of claims responding to a given rule/requirement, so I prefer Type to Form.

aepton commented 7 years ago

Curious what the next step should be - I can't merge this PR, but beyond that, should I start trying to implement a version of this spec, or move on to the campaign entities thing in PR 62? I'm new both to meaningful contributions to open source projects in general, and certainly to how y'all want to move forward on this particular project.

gordonje commented 7 years ago

@aepton yeah, come to think of it, the specific filing forms are probably way further into the weeds than most folks care to be (maybe I just want someone to come find me!). Especially, if you're doing analysis across states/jurisdictions. Categories like "quarterly filing" and "semi-annual filing" are plenty meaningful, and the forms are more like a means of satisfying legal requirements that say like "you have to submit this specific information every quarter" or whatever.

aepton commented 7 years ago

Hm, I guess that kinda implies there's a couple levels here:

Campaign files a form 21A in order to satisfy State X's quarterly requirement.

In response, our model has:

a Filing from Campaign to a from_organization a Filing Type of "stateX-21a"

When maybe it would be better to have:

a Filing from Campaign to a from_organization a Filing Type of "quarterly" a Filing Form of "stateX-21a"

with some kind of (implicit or explicit) linkage modeled in the db between FT<->FF.

This seems more expressive, but at the cost of a) probably making a larger surface area of things we have to translate/adapt/model/whatever and b) forcing a lot of things into strange categories, like, if there's no quarterly filing but there is a semi-annual for a given state, what then? tri-annual? What's the purpose of knowing whether a given form is a "quarterly filing"?

palewire commented 7 years ago

@aepton I don't know if recording if a filing is "quarterly" is necessary in all cases, but it is important for a very particular use case: Reconciling "late" filings.

In California and many other states there are greater reporting requirements in the final days before an election where contributions or expenditures of a certain size must be reported within 24 hours. Those are called "late" filings.

During that period, totaling up campaign activity requires combining those records, which rush in at a great rate, with the previously disclosed activity from quarterly filings.

After the election is over, the late filings are typically superseded, amended and expanded by a post-election periodic filing that covers the same period, but at a later date.

The reason I'm unwinding all this is that figuring out which late filings to count and which to discard is a key part of analyzing any campaign finance data set, which requires knowing each filing's type as well as the date range it covers.

fgregg commented 7 years ago

@aepton to you question on moving forward. let me do another iteration with you, and then we will pull folks like james mckinney, james turk, and bob lannon to give it a pass for compliance with OCD norms.

fgregg commented 7 years ago

@LindsayYoung it would be great to get your feedback on this.

palewire commented 7 years ago

Before we finish, I'd also like to see instructions for handling amendments more clearly spelled out.

I am totally supportive of the broad approach so far that would separate the "real" most recent versions of filings from their previous "versions," but I'm not sure the rules of how that ought to be done have been laid out fully enough for an outsider to read the document and get what we mean.

fgregg commented 7 years ago

@palewire, @aepton I would like to see if the 'versions' approach that OCD takes for bills is practicable.

Something like

versions

    All versions of the filing.

    note
        Note describing the version (e.g. 'Original', 'Amended', etc.)
    date
        The date the version was published in YYYY-MM-DD format (partial dates are acceptable).
    links

        Links to 'available forms' of the version. Each version can be available in multiple forms such as PDF and HTML. (For those familiar with DCAT this is the same as the Distribution class.) Has the following properties:

        url
            URL of the link.
        media_type
            The media type of the link.

    transactions
         ... (all the transaction data associated with a particular version)

    statements
         ... (all the financial statements associated with a particular version)

In the approach, the a filing is a unique combination of 1. committee, 2. filing type, 3. filing period.

In a paper system, each concrete, paper form is version of a filing. If there is only one paper form for that filing, then there is only one version.

jsfenfen commented 7 years ago

Hey @fgregg how would that work for 24- and 48-hour filings? Some states have these, but I'm more familiar with the federal rules so I'll cite those. Independent expenditures have to be reported within 24 or 48 hours of being made; in the last 20 days of a campaign, contributions of $1,000 or more to a candidate must be reported within 48 hours. There's not a period associated with this. Also, committees that play in many races and file quarterly have different calendars to consider, and must file pre-election reports (from the last report to 20 days before the election) in potentially dozens of states, so you'll sometimes see filings for totally odd periods. Its bad enough the FEC has to issue special notices letting everyone know about the odd filing timeline for special elections (like this).

To me this suggests putting greater emphasis on the financial summary information from a filing rather than the filing itself, but not sure if that's the direction y'all are going. And don't get me started on 90-day post inaugural reports.

fgregg commented 7 years ago

@jsfenfen

Great, this makes it clear that a "Filing" is not uniquely identified by

  1. committee,
  2. filing type,
  3. filing period.

Because filing period is an ill defined thing for some filings.

Just to make sure we are on the same page, the goal of uniquely identifying a filing is so that we can sanely deal with corrections and amendments to a filing.

The thing that uniquely identifies these independent expenditure filings is what?

  1. committee
  2. filing type
  3. period covered by this filing

Is that right? If a correction to an independent expenditure came in, how would you recognize the original filing it is amending?

palewire commented 7 years ago

@fgregg What you sketched out is correct in California. An amendment to a "late" independent expenditure filing would have an identical filing ID and filing type as its predecessor.

A simple example can be found in two filings -- Version 1, Version 2 -- reporting independent expenditures made in our 2014 state controller race. As you can see in those URLs, the filing ID is consistent, and the amendment id increments up one.

I hesitate to guarantee that the filing's date period will be identical in 100% of cases because I can imagine an instance where the content being amended is the date range itself.

Here's where the date ranges are essential: Reconciling these late filings with quarterly filings to generate a correct list of each committee's up-to-the-moment activity.

Most late filings will later be superseded by future quarterly filings and should only be added to a committee's global totals during the period where their date ranges are greater than (i.e. more recent) than the most recent quarterly filing. Once a more recent quarterly filing is submitted, a common practice is to discard that committee's prior late filings.

For that reason, more than handling amendments, I believe it is essential to have datetime values integrated into filing models.

jsfenfen commented 7 years ago

@fgregg "The thing that uniquely identifies these independent expenditure filings is": an agency-generated filing id, for electronic filings. An amended electronic filing in fedland has the id of the filing it amends in it's header. Also the file spec doesn't even include a start range, folks who build sites that display that need to compute that themselves. Perversely, the min/max date could be the thing they amended (if they amend the disbursement date--not as rare as you might think). At one point they added a date field (maybe disbursement date vs communication date, I forget the details I think it was v6.0->v6.1)

fgregg commented 7 years ago

@jsfenfen, okay. this is going to be a pretty sticky wicket, because some states, like Illinois do not provide a filing id that can be used in this way that you describe.

Let me attempt to summarize.

Filings with filing period

There are filings that really do have a filing period. The filing is supposed to represent all* financial activity between a regulator(s) defined start date and end-date (Where all means all types of activity that the regulator requires).

The exemplar here is quarterly reports.

These types of filing should be uniquely identified by committee, filing type, and filing period.

In fedland, there would a one-to-one correspond between filing id on one hand and a unique combination of committee, filing type, and filing period on the other.

Episodic Filings

There are filing that are, shall we say, episodic. If a particular type of financial activity happens, it triggers a filing within some period of time.

Absent a regulator-provided filing id there is no reliable way to identify a unique filing, for purposes of grouping original forms and amended forms.

jsfenfen commented 7 years ago

Hmm, @fgregg, you're right this is pretty different...

The thing that I'm usually trying to figure out on this stuff can be boiled down to one question: is this filing live? (Which I think @palewire got at in another comment) Or, relatedly, should I count it when creating totals? (I'm not sure how much either of those is that's a consideration for this process, I should note)

What if there was just a binary flag for whether the filing is the most recent copy or has been outdated (or superseded by a periodic filing). And so, ahem, it would be up to users (I guess that would be dorks submitting lightly sauteed filing data) to (optionally) come up with the order of amendments, or just a simple statement of which one is live, in a jurisdiction that doesn't explicitly state that.

fgregg commented 7 years ago

In order to have that kind of flag you need to figure out which filings are redundant. And figuring that out is exactly what we are talking about right now.

aepton commented 7 years ago

I think we can capture this with a combination of @fgregg's suggestion of actions on filings and @jsfenfen's is_current flag.

When a filing is submitted it comes with an action, first_submission. A subsequent amendment gets attached to that filing. Both actions come with lists of all the transactions in that version of the filing, and each action also gets an is_current flag - set to true when the action first comes across the transom in most cases, though the jurisdictional parsers are responsible for that.

So now, given the filing and its amendment, you just look for the action marked is_current and the transactions there are all current.

Now let's say that the same committee files a quarterly report, which overwrites all the disclosures like our example. Upon receipt of report, the jurisdictional parser goes into the system and marks each action on each of the superseded filings is_current=False.

fgregg commented 7 years ago

@aepton I think there's a number of questions to be resolved, I think we are a point where progress will be furthered by an attempted implementation.

palewire commented 7 years ago

@fgregg We are currently working through the process of refining our raw data into humanized models at the django-calaccess-processed-data repository.

Is there are a particular piece we should try to implement as first pass?

We're currently coming in at the problem around the edges and are closest to an "Election" model like has been discussed in #62.

fgregg commented 7 years ago

@palewire

The current reference implementations for OCD models live at https://github.com/opencivicdata/python-opencivicdata-django/tree/master/opencivicdata/models

It would be great to do a couple of things as you work on the calaccess data.

  1. Attempt to use the existing models in that repo for Organizations (including committees), Posts (which are what we typically call offices), and People.
  2. If you want to work on Election stuff next, take over the work that I just barely started on some models here: https://github.com/datamade/docs.opencivicdata.org/blob/elections/proposals/drafts/elections.rst
jungshadow commented 7 years ago

If you want to work on Election stuff next, take over the work that I just barely started on some models here: https://github.com/datamade/docs.opencivicdata.org/blob/elections/proposals/drafts/elections.rst

👍 Really like that you stuck close to the @votinginfoproject specification on that proposal. You may already know this, but, in turn, @votinginfoproject is collaborating with NIST and their public working groups. Hopefully, all these different-but-related lines of work stay in sync.

aepton commented 7 years ago

I'm happy to start work on a reference implementation of this proposal for Washington state. I have some work to do on my platform before I'm ready to start, but I should be able to get on it soon. Does anything else need to happen for this PR to be merged?

fgregg commented 7 years ago

I think it's ready to be merged as a proposal, but I don't have the permission bits for that.

attn @jamesturk @jpmckinney

jpmckinney commented 7 years ago

Note: I haven't read the full thread. Just reading the document and searching through the comments:

Filing

  1. cf-filing: Why not campaign-finance-filing to avoid opaqueness and/or ambiguity?
  2. committee and regulator: Any objection to generalizing to sender and recipient? "committee" is not universal way of referring to the organizations submitting filings. This would also make sense if we later introduce other types of filings.
  3. coverage_begin_date and coverage_end_date: Why not simply valid_from and valid_until start_date and end_date? Anyway, start would be more consistent with other classes than begin.
  4. inciter: This seems like an unusual choice of term. Why not agent?
  5. invalidates_prior_versions: supersedes is more common / appropriate than invalidates.
  6. is_current: Boolean flags tend to be a bad pattern in data schemas. Data should strive to be 'add-only'. From what I can read as the discussion, the logic for this field is somewhat complicated. What are some alternatives to achieving the desired outcomes?
  7. relevant_election: Just make it election - the schema doesn't care about irrelevant elections. Anticipating future filing classes with which this class should have common properties, we can consider making this more generic, like context or legislative_context.
  8. responsible_person: Can someone expand on the semantics of this property? Is it different from 'contact person'?

Committee

This should be a subclass of Popolo's Organization. From what I can tell, only statusis a new property (or should it be statuses since it is an array?). begin_date should be start_date to be consistent with all other classes. For sub-objects like this, note is more common for description.

I'm not sure why committee type is its own object. Perhaps in terms of the code implementation it makes sense to have a code list as an object, but in terms of the schema, a controlled vocabulary can be used for a committee's classification property.

With respect to a committee type's jurisdiction, that actually has to do with a registration that the committee has with a registrar in a particular jurisdiction. So, I would model that as a registration, not as some de-normalized property on a committee type.

Candidate Designation

I don't see any property on other classes that has designations as its range (possible value). How do other classes connect to this class?

Person

Person in Popolo is a real person, so you can't use it for corporations...

Filing Type

See comments about committee types.

Transaction

jpmckinney commented 7 years ago

@LindsayYoung Where can I see the FEC's schemas?

jpmckinney commented 7 years ago

Re: new elections models, see my comment https://github.com/popolo-project/popolo-spec/issues/104#issuecomment-268688640 Anyway, let's not have an Elections discussion in this already-long issue! Please create a new issue.

LindsayYoung commented 7 years ago

Great question @jpmckinney

Here are the API schemas: https://api.open.fec.gov/swagger/

Click through to the metadata for the other FEC schemas http://www.fec.gov/data/DataCatalog.do

aepton commented 7 years ago

@jpmckinney

Filing

  1. Went with campaignfinance there and elsewhere.
  2. I think "committee" is a better abstraction than "sender", but "filer" is better still, imho. No objection to "recipient", though I'm not sure who receives filings besides regulators. Changed.
  3. Switching to coverage_start_date. I don't think "valid" is quite the right word here - the notion here isn't one of validity, but simply the period of time a filing describes. I think "coverage" captures that, and "valid" introduces some ambiguity - if the "valid_end_date" is before the current date, is that filing now invalid?
  4. It was initially "responsible_person"; "inciter" is more general and "agent" is more general still. Changed.
  5. Changed.
  6. This is one area where the thing being captured is inherently hard to pin down. Almost every filing will have an is_current=True set when first filed, and the system is then responsible for keeping that flag up to date. I think that system compartmentalizes the responsibility of determining which set of filings is "current" without adding unnecessary complexity elsewhere, or introducing further ambiguity. We could remove it and leave all such decisions up to each user/dependent system, but I think that would cause more trouble than it would resolve. Alternatively, we could model a set of filings and their currentness apart from the Filing model, but I think that adds more complexity as well without being a better solution.
  7. Changed to "election". I'd like to avoid overgeneralizing from the get-go; this is easy to make more general, should we eventually go down that path, but I think "election" is clear and concise in this context in a way "context" isn't, and certainly "legislative context" isn't.
  8. It's really the same as "agent" and I'm not sure why we kept it separate, now. I'll remove it; the "agent" field in the "actions" should be able to capture this comprehensively.

Committee

Changed to start_date, note and statuses. Added note about making this a subclass of Organization; should we just provide the fields that are different here then?

I think committee_type should be its own object because any given jurisdiction will have several different types that don't necessarily translate cleanly across jurisdictions. And in cases where they do, the rules will nevertheless be different - candidate committees in WA have different rules apply to them than do candidate committees in IL, for instance.

Registration filings should be captured by the Filing object; the jurisdiction filed here is meant to reflect which locality(ies) a committee belongs to, and hence, which laws apply to it (among other things).

Candidate Designation

That was an oversight; added a field for that to Committee.

Person

What should we use here, then? Subclass of Popolo Person for "campaign finance persons" who, thanks to our Supreme Court, may in fact be corporations? This is an ambiguity not easily resolved; most of the time from what I've seen, looking at a given transaction it's impossible to tell if it's a person or a corporation unless you're a human using human heuristics that I'm uncomfortable emulating in this system.

Filing Type

These are useful to model the actual filings committees submit, which have meaning in various contexts, and may help us construct the is_current_filing chain (certain types get superseded by other types, in certain states, at certain times of day, with Venus in the appropriate phase, etc.) And these filings vary titanically from state to state, so I think they're worth modeling as first-class objects.

Transaction

  1. Well, I think this is more specifically saying, "to which action on a Filing does this transaction belong" but the description didn't make that clear, so I updated it.
  2. Done.
  3. No reason; fixed.
  4. Nice. Fixed.
  5. Fixed.
  6. Fixed.
  7. Fixed.
jsfenfen commented 7 years ago

@jpmckinney The spec for the actual forms that filers submitted are detailed here http://www.fec.gov/elecfil/vendors.shtml, though it helps to know a bit about the rules for submitting them.

jsfenfen commented 7 years ago

@aepton @jpmckinney +1 for filer rather than committee, because in some jurisdictions folks who have to file campaign finance reports are explicitly not committees, and do not have to register as such (and there's a number of ongoing lawsuits arguing that some filers really should be committees subject to committee rules, etc.)

jpmckinney commented 7 years ago

Is this spec targeting only the FEC? My understanding was the goal was broader.

Otherwise I can do one more look over and merge.

palewire commented 7 years ago

@jpmckinney This pull request was started by @aepton after we discussed common challenges dealing with Washington state and California campaign finance data. Our goal is for this schema to work with statehouses as well as the federal data as much as possible.

aepton commented 7 years ago

@jpmckinney Yeah, +1 to what @palewire said. I'd love it to work with any campaign finance situation, ideally - the Toronto civic data folks seemed interested, for instance.

aepton commented 7 years ago

Anything else need to be done for this, or can it be accepted?

jpmckinney commented 7 years ago

@aepton I was going to do one more read-through - ideally this weekend.

aepton commented 7 years ago

Just pinging this :)

jpmckinney commented 7 years ago

Merging the draft 🎉

Going to follow-up in new issues/PRs.

jpmckinney commented 7 years ago

Who are the primary contacts among the contributors to this thread for future modification of this OCDEP?

fgregg commented 7 years ago

@jpmckinney I'm not sure what you are asking?

jpmckinney commented 7 years ago

I just want to know whom to keep in the loop. I don't want to @ everyone in every issue/PR I open unless everyone wants me to.