palewire / django-calaccess-processed-data

A Django app to transform and refine campaign-finance data from the California Secretary of State’s CAL-ACCESS database
http://django-calaccess.californiacivicdata.org/
MIT License
9 stars 43 forks source link

Scraped candidates without scraped IDs #28

Closed gordonje closed 6 years ago

gordonje commented 7 years ago

On the Candidates & Elected Officials page in CAL-ACCESS, some names are not hyperlinked and, thus, are collected without a scraped ID. We're talking roughly 40 percent of the collected candidates (about 2,400 records).

Can we figure out what's going on with these?

gordonje commented 7 years ago

One example is KEVIN D MELTON who ran for assembly seat #78 in the 2016 primary and general elections.

Querying the FILERS_CD table like so:

SELECT *
FROM "FILERNAME_CD"
WHERE UPPER("NAML") || UPPER("NAMF") LIKE '%MELTON%KEVIN%';

There are five records that seem to map to this guy:

FILER_ID  XREF_FILER_ID  NAML    NAMF     FILER_TYPE                     STATUS  EFFECT_DT   ADR1                           
--------  -------------  ------  -------  -----------------------------  ------  ----------  -----------------------------  
1382941   1382941        MELTON  KEVIN    TREASURER/RESPONSIBLE OFFICER  ACTIVE  2016-02-16  4876 SANTA MONICA AVE #148     
1382943   1382943        MELTON  KEVIN    CANDIDATE/OFFICEHOLDER         ACTIVE  2016-02-16  4876 SANTA MONICA AVE #148     
1358738   1358738        MELTON  KEVIN D  CANDIDATE/OFFICEHOLDER         ACTIVE  2015-04-27  4876 SANTA MONICA AVENUE #148  
1358738   1358738        MELTON  KEVIN D  CANDIDATE/OFFICEHOLDER         ACTIVE  2015-04-27  4876 SANTA MONICA AVENUE #148  
1358738   1358738        MELTON  KEVIN D  CANDIDATE/OFFICEHOLDER         ACTIVE  2015-04-27  4876 SANTA MONICA AVENUE #148  

No errors when passing any one of these FILER_ID values into the query string for Campaign/Committees/Detail.aspx:

These web pages all include the message:

This committee has not electronically filed a Form 460/461/450 for this election cycle. For further information, click on prior sessions to see if historical filings are available. Also check for late contribution filings if a major filing deadline has not yet occurred for this election cycle.

Our derived calaccess_processed_candidatecommittee table only has one mapping between one of these candidate filer_ids and a committee filer_id:

id     candidate_filer_id  committee_filer_id  link_type_id  link_type_description              first_session  last_session  first_effective_date  last_effective_date  first_termination_date  last_termination_date  
-----  ------------------  ------------------  ------------  ---------------------------------  -------------  ------------  --------------------  -------------------  ----------------------  ---------------------  
24253  1358738             1382942             12011         CANDIDATE CONTROLS THIS COMMITTEE  2015           2015          2016-02-16            2016-02-16           (null)                  (null)                 

I also checked the FILER_LINKS_CD table from which we derive the above records. This appears to be the only candidate-to-committee mapping.

Here's where it starts to get a little weird: There actually are four Form 460 filings from this committee in 2016:

id       FILER_ID  FILING_ID  PERIOD_ID  FORM_ID  FILING_SEQUENCE  FILING_DATE  STMNT_TYPE  STMNT_STATUS  SESSION_ID  USER_ID    SPECIAL_AUDIT  FINE_AUDIT  RPT_START   RPT_END     RPT_DATE  FILING_TYPE  
-------  --------  ---------  ---------  -------  ---------------  -----------  ----------  ------------  ----------  ---------  -------------  ----------  ----------  ----------  --------  -----------  
1930059  1382942   2090713    (null)     F460     0                2016-10-07   10005       11003         2015        MPACHECO   11003          11003       2016-07-01  2016-09-24  (null)    (null)       
1902735  1382942   2072025    (null)     F460     0                2016-07-18   10005       11001         2015        MDIGRAZIA  11003          11001       2016-05-22  2016-06-30  (null)    (null)       
1884296  1382942   2052405    (null)     F460     0                2016-06-06   10005       11001         2015        MDIGRAZIA  11003          11001       2016-04-24  2016-05-21  (null)    (null)       
1867511  1382942   2045078    (null)     F460     0                2016-05-11   10005       11001         2015        MDIGRAZIA  11003          11001       2016-01-01  2016-04-23  (null)    (null)       

But there aren't any records with these filing_ids in CVR_CAMPAIGN_DISCLOSURE_CD, RCPT_CD or EXPN_CD

SELECT 'CVR_CAMPAIGN_DISCLOSURE_CD' AS tablename, COUNT(*) AS the_count
FROM "CVR_CAMPAIGN_DISCLOSURE_CD"
WHERE "FILING_ID" IN (
    2090713,
    2072025,
    2052405,
    2045078
)
UNION
SELECT 'RCPT_CD' AS tablename, COUNT(*) AS the_count
FROM "RCPT_CD"
WHERE "FILING_ID" IN (
    2090713,
    2072025,
    2052405,
    2045078
)
UNION
SELECT 'EXPN_CD' AS tablename, COUNT(*) AS the_count
FROM "EXPN_CD"
WHERE "FILING_ID" IN (
    2090713,
    2072025,
    2052405,
    2045078
);

Which may be why they are suppressed by the website? Or are these being exclude because of some particular value in some particular column (I looked at FILER_FILINGS_CD.STMNT_STATUS, but it is not that.

But at least our derived processed models are consistent with the website in excluding these filings. Might be worth looking at other examples of scraped candidates who did not have hyperlinks.

gordonje commented 7 years ago

Couple more observations...

Pass the any of these Form 460 filing_ids into the query string for the CAL-ACCESS pdfgen tool, and you get a CGI error:

Also, if you look these filing_ids up in the FILINGS_CD table, all of them have FILING_TYPE value of 22005, which maps to something called "AMS". According to the CAL-ACCESS overview this stands for Agency Management Subsystems. Based on the workflow outlined in the diagram, I suspect AMS filings are all manually entered by staff members of the Secretary of State, Political Reform Division.

gordonje commented 7 years ago

The bottom line question, which we seem incapable of answering on our own, is "How much money did Kevin Melton raise and spend while running for the state assembly in 2016?"

According the FPPC's current manual, candidates are supposed to file semi-annual campaign statements regardless of whether or not they raise or spend money.

So what exactly happened here? These are the only scenarios we can imagine:

  1. Melton for State Assembly 2016 failed to honor this rule, though it seems like they made an attempt?
  2. Melton for State Assembly 2016 did it's part, but the CAL-ACCESS system failed to retain their data.
palewire commented 6 years ago

While we do not know why the SoS does not publish an ID on the website, we have fixed this issue on our backend by reconciling the scrape with Form 501s and other CAL-ACCESS tables.