palewire / django-calaccess-campaign-browser

A Django app to refine, review and republish campaign finance data drawn from the California Secretary of State’s CAL-ACCESS database
http://django-calaccess-campaign-browser.californiacivicdata.org
MIT License
17 stars 12 forks source link

different approach to removing data from filings' early amendments #203

Open rkiddy opened 9 years ago

rkiddy commented 9 years ago

I am trying to understand what you guys are doing and I think I can make this suggestion, reasonably, based on what I am seeing now.

It appears that you all build up the filings table by pulling part of the data out of the FILER_FILINGS_CD table. You only use filings for 3 forms, the 450, the 460 and the 497. And then you mark the early amendment filings with the is_duplicate flag. But then you also set the is_duplicate flag for the rows where the PERIOD_ID does not match to your table for the periods.

You do not delete the rows marked with is_duplicate but I do not see in other code where you filter these rows out of results. But I may just be not seeing it. Regardless, it appears you are keeping the early amendment rows in the same table as the others.

I wanted to suggest an approach I was working on, which is to partition all tables that contain a FILING_ID by several criteria. The first criteria was the early amendment. So, for example, I created a "expn_lows" table next to my "expn" table and moved all early amendment data to that table so that those rows would not appear in the "expn" table.

First, I add a amend_id_high column to the filer_filings table and run SQL like this on all of the tables that contain a FILING_ID column and as AMEND_ID column:

--------------
update filer_filings f1,
            (select filing_id, max(amend_id) as max 
                  from cvr_campaign_disclosure group by filing_id) as a1
            set f1.amend_id_high = a1.max 
            where f1.filing_id = a1.filing_id and 
                  (f1.amend_id_high is NULL or f1.amend_id_high < a1.max)
--------------

Query OK, 190135 rows affected (53.69 sec)
Rows matched: 190135  Changed: 190135  Warnings: 0

--------------
update filer_filings f1, 
            (select filing_id, max(amend_id) as max 
                   from cvr_e530 group by filing_id) as a1 
            set f1.amend_id_high = a1.max 
            where f1.filing_id = a1.filing_id and 
                   (f1.amend_id_high is NULL or f1.amend_id_high < a1.max)
--------------

Query OK, 187 rows affected (0.68 sec)
Rows matched: 187  Changed: 187  Warnings: 0

I think having the high amend_id value associated clearly with the filer_filings row is useful and will be more useful when other forms get analysed.

Just FYI, I then partition each table into years, so I have "expn_2000", "expn_2001", ..., "expn_2015" and "expn_else" (for everything < 2000 and > 2015). And then I was going to start partitioning the tables by form_id. This would end up creating a lot of tables, but on the other hand, it would be pretty obvious what the "F460_EXPN_2014" table was supposed to represent, where it would be used and it could make it easier for apps to access the data more quickly, assuming they would search within a year. This might be good, yes?

I am not asking for someone else to do work based on this suggestion. But I would like to know whether you all think that this might be a reasonable thing to consider.