palewire / django-calaccess-raw-data

A Django app to download, extract and load campaign finance and lobbying activity data from the California Secretary of State's CAL-ACCESS database
http://django-calaccess.californiacivicdata.org/
MIT License
64 stars 143 forks source link

Add choices and documentcloud_pages to FilerStatusTypesCd.status_type (in other.py) #1440

Closed gordonje closed 8 years ago

gordonje commented 8 years ago

Overview

Add choices and documentcloud_pages to the status_type field on FilerStatusTypesCd the database model.

Basically, we think status_type should be defined as a "choice field" (in Django parlance), which just means it's a field with a defined list of valid values and an intelligible definition for each value.

Our goal is for every potential choice field to include references to locations in the official documentation where the field's valid values and their meanings are described AND to have those choices included in the field's definition.

And we need your help!

What to do

Step 1: Claim this ticket by adding yourself as an Assignee (to the right)

Step 2: Find where in the official documentation status_type's list of valid values is defined.

The best place to look first is in the .CAL layout documents: .CAL Format version 2.01 and .CAL Format version 1.05.02. These are documents that describe the layout of the .CAL file format, which is the required electronic format of any filings ingested by the CAL-ACCESS system (more on that here if you are interested).

You can start by opening the document and just searching for "status_type". For example, if you were assigned to tackle the stmt_type field on the CvrCampaignDisclosureCd model, you would be looking for something like this::

Docs

But note that the same column might appear on multiple tables, and the list of valid values may vary depending on the table. We want to be sure to find the list of valid values for the status_type on the FilerStatusTypesCd model.

Another good place to look is in the MapCalFormat2Fields document, which describes how .CAL format fields are mapped to CAL-ACCESS database table columns. Note the _ValidV column on the far right.

Step 3: Copy the DocumentCloud ID and the page number(s) where status_type's list of valid values is defined.

The DocumentCloud ID can be found the in the URL that points to the document. For example, the url to the .CAL Format version 2.01 doc is http://www.documentcloud.org/documents/2712034-Cal-Format-201.html#document/p1, and the id is 2712034-Cal-Format-201.

And the start page we want to grab is the DocumentCloud page number, not the page number printed on the .PDF. If the list of valid values spans multiple pages, then grab the page where the list ends as well.

Step 4: Find where status_type is defined in other.py in the /calaccess_raw/models/other.py file underneath the FilerStatusTypesCd class.

Since the model files are rather long, take care to be sure you find the where status_type is defined for the FilerStatusTypesCd model. It should look something like this:

    class FilerStatusTypesCd(CalAccessBaseModel):
        # ...
        # several lines of code later...
        # ...
        status_type = fields.CharField(
            max_length=11,
            db_column='STATUS_TYPE',
            blank=False,
            verbose_name='status type',
            help_text='This field is undocumented',
        )

Step 5: Add the document references

Say you found the list of valid values listed in three places. You would list them in the documentcloud_pages attribute like this:

    class FilerStatusTypesCd(CalAccessBaseModel):
        # ...
        # several lines of code later...
        # ...
        status_type = fields.CharField(
            max_length=11,
            db_column='STATUS_TYPE',
            blank=False,
            verbose_name='status type',
            help_text='This field is undocumented',
            documentcloud_pages=[
                DocumentCloud(id='2711616-MapCalFormat2Fields', start_page=10),
                DocumentCloud(id='2712033-Cal-Format-1-05-02', start_page=19),
                DocumentCloud(id='2712034-Cal-Format-201', start_page=24, end_page=25),
            ]
        )

Note: You only need to specify the end_page if the list spans multiple pages. Step 6: Update the status_type's help_text (if necessary)

It might be missing or it might be an empty string or it might not be consistent with what you see in the documentation. Note: The entire help_text string needs to be enclosed in single- or double-quotes like this:

    help_text='This is the help text',

If the help text is really long (80+ characters), then it needs to span multiple lines like this:

    help_text='This is the help text...blah...blah...blah...blah...blah...blah...\ 
...blah...blah...blah',

Step 7: Define the choices

The normal Django pattern is to define choices within the model class as a tuple which contains a bunch of two item tuples. The first item of each inner tuple is the valid database column value, and the second item is the definition. It should look something like this:

    STATUS_TYPE_CHOICES = (
        ('AAA', 'Some definition of AAA'),
        ('BBB', 'Some definition of BBB'),
        ('CCC', 'Some definition of CCC'),
        # ...and more..
    )

Then we assign this tuple of tuples to 's choice argument with a single line: choices=STATUS_TYPE_CHOICES.

When finished, it should all look something like this:

    class FilerStatusTypesCd(CalAccessBaseModel):
        # ...
        # several lines of code later...
        # ...
        STATUS_TYPE_CHOICES = (
            ('AAA', 'Some definition of AAA'),
            ('BBB', 'Some definition of BBB'),
            ('CCC', 'Some definition of CCC'),
            # ...and more..
        )
        status_type = fields.CharField(
            choices=STATUS_TYPE_CHOICES,
            max_length=11,
            db_column='STATUS_TYPE',
            blank=False,
            verbose_name='status type',
            help_text='This field is undocumented',
            documentcloud_pages=[
                DocumentCloud(id='2711616-MapCalFormat2Fields', start_page=10),
                DocumentCloud(id='2712033-Cal-Format-1-05-02', start_page=19),
                DocumentCloud(id='2712034-Cal-Format-201', start_page=24, end_page=25),
            ]
        )

So some of these choice fields show up on multiple tables (the various office_cd fields are a good example), and of course it's rather redundant to repeat these values and their definitions over and over. In some these cases, we've consolidated these valid values into one place: calaccess_raw/annotations/choices.py.

So then instead of listing out all the choices in the tuple of tuples, we can do this:

    OFFICE_CD_CHOICES = get_sorted_choices(choices.OFFICE_CODES)
    office_cd = fields.CharField(
        choices=OFFICE_CD_CHOICES,
        # ...other stuff...
    )

If you think the field your looking at is a candidate for this sort of consolidation, let's have a conversation about it. Just post a comment below.

If you can, check the status_type's defined choices against the actual values in status_type's database column.

This will require you to install the app and run the updatecalaccessrawdata command. Then you can get the distinct column values with a GROUP BY query like this:

    SELECT STATUS_TYPE, COUNT(*)
    FROM FILER_STATUS_TYPES_CD
    GROUP BY 1
    ORDER BY 1;

It's likely that you'll find undocumented values in database column, some of which are variants of the valid values. You can note these by repeating the same definition for each variant like this:

    OFFICE_CD_CHOICES = (
        ('GOV', 'Governor'),
        ('gov', 'Governor'),
        ('GUV', 'Governor'),
        # ...
    ),

If you're dealing with a set of choices that's been consolidated into calaccess_raw/annotations/choices.py, you can map the different variants to the same definition using the key of the choices dict like this:

    OFFICE_CD_CHOICES = (
        ('gov', look_ups.OFFICE_CODES['GOV']),
        ('GUV', look_ups.OFFICE_CODES['GOV']),
        # ...
    ),

If you're not able to check the database column values for whatever reason, don't sweat it. We'll deal with it in a later issue.

Wrap up: Review your changes and create a pull request

Pull request

That's it!

PS: If any of this feels confusing, or if it doesn't seem to match up with what you are seeing, or if you have any suggestions about another approach, don't hesitate to speak up. You can post a comment below, or hit up the #california-civic-data channel on the News Nerdery Slack.

gordonje commented 8 years ago

FilerStatusTypesCd is actually a look-up table.