palewire / django-calaccess-raw-data

A Django app to download, extract and load campaign finance and lobbying activity data from the California Secretary of State's CAL-ACCESS database
http://django-calaccess.californiacivicdata.org/
MIT License
64 stars 143 forks source link

Idea for correcting data #1507

Closed rkiddy closed 2 years ago

rkiddy commented 6 years ago

I have been examining some of the errors in the data. Since the SoS publishes the data and only adds to it, and will never correct any line of data once published, it really should be able to correct the data.

I will see about proposing a PR for this. My idea now is to have a database of known errors to correct. The update method, while converting the TSV file to a CSV file, will do the corrections as directed by this data. For example, here is a sample.

mysql> select * from import_corrections;
+------+------------------------------+----------+--------------+---------------+
| pk   | file_name                    | line_num | action       | payload       |
+------+------------------------------+----------+--------------+---------------+
|    1 | example/data/tsv/EXPN_CD.TSV |  1244682 | REMOVE_FIELD | 40 of 54      |
|    2 | example/data/tsv/EXPN_CD.TSV |  1761310 | REMOVE_FIELD | 8 of 54       |
|    3 | example/data/tsv/EXPN_CD.TSV |  2592070 | REMOVE_FIELD | 8 of 54       |
|    4 | example/data/tsv/EXPN_CD.TSV |  2805048 | REMOVE_FIELD | 21 of 54      |
|    5 | example/data/tsv/EXPN_CD.TSV |  2805052 | REMOVE_FIELD | 21 of 54      |
|    6 | example/data/tsv/EXPN_CD.TSV |  2806444 | REMOVE_FIELD | 21 of 54      |
|    6 | example/data/tsv/EXPN_CD.TSV |  2807178 | REMOVE_FIELD | 21 of 54      |
|    6 | example/data/tsv/EXPN_CD.TSV |  2807179 | REMOVE_FIELD | 21 of 54      |
|    6 | example/data/tsv/EXPN_CD.TSV |  2808470 | REMOVE_FIELD | 21 of 54      |
+------+------------------------------+----------+--------------+---------------+
9 rows in set (0.00 sec)

Obviously there may be other types of corrections needed, but I have not seen any yet. For now, removing an empty field seems to be the only correction needed. If other corrections methods are needed, they should also be describable via a "action" and "payload". The payload may, of course, have different types of information for another actions. For example, we may need to correct a mis-spelling and doing that would be simple enough.

The line: "example/data/tsv/EXPN_CD.TSV 2808470 REMOVE_FIELD 21 of 54" specifically means: In file EXPN_CD.TSV, at line 2808470, remove field 21 if 54 fields are found."

Each correction would be idempotent.

Well, let me know if you have suggestions.

palewire commented 6 years ago

Not a bad idea!

rkiddy commented 4 years ago

Hello. So, it has been a long time. I have been doing other things with this data, but I wanted to ask what activity is happening with this project lately. It looks as though there have been no changes for a while. So, it working great? It is doing the job for you all? Any news or thoughts on this project and its status?

much thanx - ray