sunlightlabs / echelon

library for analyzing it all
GNU General Public License v3.0
31 stars 4 forks source link

Annotate lobbying data json #2

Closed zmaril closed 9 years ago

zmaril commented 10 years ago

To make it easier for others to work with the resolution data, it is nessecary to annotate the orignal json with the resolved being ids. That means simply adding in new properities to the old json in the apporirate places.

zmaril commented 10 years ago

Currently blocking on #1 for the moment, or at least until the severity of that bug can be figured out.

zmaril commented 9 years ago

With this https://github.com/sunlightlabs/echelon/commit/af3c141833cfd5f40143199817e80576cec5c23a, I think I've got the files annotated. Need to sync up with @boblannon next week to talk about how to deliver the data. Still unsure if there is a bug with the resolution process, but this should be enough to get moving at least.

zmaril commented 9 years ago

I finally figured out what this bug was. When I went to annotate the original json data with being id's, there would be instances where the annotated id would be nil. A partial, thinned out example of one such file (which is represented as a nested map in clojure) can be seen below. Any place where a ;; .... appears indicates that I've removed data for clarity.

{;; ....
 "registrant"
 {;; ....
  "being_id" nil
   ;; ......
  },
 ;; .....
 "document_id" "7e553ed2-a295-4ba5-9914-ac17b9e747f8",
 ;; ....
 "client"
 {"being_id" nil,
 ;; ..... 
},
;;....
}

The "document_id" is the unique identifier for each file/record in the database. With it we can look up the corresponding record entity in the database and see what the data actually looks like. Using the pull api and a lookup ref, we pull out a ton of data. Additionally, I've defined a function called replace-record-types that walks over the result of a pull and replaces echelon specific refs with more useful information for our purposes.

(defn replace-record-types [dbc m]
  (letfn [(identier [arg]
            (if (and (coll? arg) (= :record/type (first arg)))
              (update-in arg [1] (comp :db/ident (partial d/entity dbc) :db/id))
              arg))
          (beer [arg]
            (if (and (coll? arg) (= :record/represents (first arg)))
              (update-in arg [1] (comp :being/id (partial d/entity dbc) :db/id))
              arg))]
    (->> m
         (clojure.walk/postwalk identier)
         (clojure.walk/postwalk beer))))

(let [form [:lobbying.form/document-id "7e553ed2-a295-4ba5-9914-ac17b9e747f8"]
      dbc (db (d/connect uri))]
  (replace-record-types dbc (d/pull dbc '[*] form)))

;;Result
{:lobbying.registration/new-client-for-existing-registrant false,
 :lobbying.form/registrant
 {:db/id 17592186102489,
  :record/type :lobbying.record/registrant,
  :lobbying.registrant/organization-name "American Society for Nutrition",
  :lobbying.registrant/description
  "Professional scientific society; Publisher of scientific journals",
  :lobbying.registrant/main-address
  {:db/id 17592186102490,
   :address/first-line "9650 Rockville Pike",
   :address/second-line "",
   :address/zipcode "20814",
   :address/city "Bethesda",
   :address/state "MD",
   :address/country "USA"},
  :lobbying.registrant/principal-place-of-business
  {:db/id 17592186102491,
   :address/zipcode "",
   :address/city "",
   :address/state "",
   :address/country "",
   :lobbying.registrant/self-employed-individual false},
  :parsed/name "american society for nutrition"},
 :lobbying.form/client-registrant-same true,
 :lobbying.form/client
 {:db/id 17592186102492,
  :record/type :lobbying.record/client,
  :record/represents "546aa1a9-327f-4267-90aa-7f0d29f35268",
  :lobbying.client/name "American Society for Nutrition",
  :lobbying.client/description "",
  :lobbying.client/main-address
  {:db/id 17592186102493,
   :address/first-line "",
   :address/zipcode "",
   :address/city "",
   :address/state "",
   :address/country "USA"},
  :lobbying.client/principal-place-of-business
  {:db/id 17592186102494,
   :address/zipcode "",
   :address/city "",
   :address/state "",
   :address/country ""},
  :parsed/name "american society for nutrition"},
 :lobbying.registration/senate-id "400340581",
 :lobbying.form/amendment false,
 :lobbying.form/source {:db/id 17592186045504},
 :lobbying.registration/activity
 {:db/id 17592186102495,
  :record/type :lobbying.record/activity,
  :record/represents "546a97cb-5056-4c69-99b9-e125d015c6df",
  :lobbying.activity/lobbyists
  [{:db/id 17592186102496,
    :data/position 0,
    :record/type :lobbying.record/lobbyist,
    :record/represents "546a97cb-f710-4351-8d84-fd41236a6b19",
    :lobbying.lobbyist/first-name "Mary Lee",
    :lobbying.lobbyist/last-name "Watts",
    :lobbying.lobbyist/suffix "",
    :lobbying.lobbyist/covered-official-position
    "Legislative Correspondent, U.S. Sen. Zell Miller (D-GA)"}],
  :lobbying.activity/general-details
  "Budget and appropriations for federal agencies: NIH, USDA, CDC\nLegislative issues related to nutrition labeling and programs"},
 :lobbying.form/filepath
 "/home/zmaril/new-data/sopr_html/2008/REG/7e553ed2-a295-4ba5-9914-ac17b9e747f8.json",
 :lobbying.registration/new-registrant true,
 :lobbying.registration/effective-date #inst "2008-09-05T00:00:00.000-00:00",
 :lobbying.registration/house-id "",
 :lobbying.form/contact
 {:db/id 17592186102497,
  :record/type :lobbying.record/contact,
  :record/represents "546a97cb-7937-4886-814f-6c549571ae03",
  :lobbying.contact/name "Mrs. Mary Lee Watts",
  :lobbying.contact/phone "3016347112",
  :lobbying.contact/email "mwatts@nutrition.org"},
 :db/id 17592186102488,
 :lobbying.form/signature-date #inst "2008-09-05T00:00:00.000-00:00",
 :lobbying.form/document-id "7e553ed2-a295-4ba5-9914-ac17b9e747f8",
 :record/type :lobbying.record/registration}

This produces a ton of information, that all seems to look valid. No glaring errors here. Let's take a look at all the changes to the registrant datomic entity that have occurred. This entity has the id 17592186102489 and we are interested in all changes that were made to the value of the :record/represents attribute. We'll use the history to see the various changes that were made:

(let [dbc (db (d/connect uri))]
  (->>
   (d/q '[:find ?tx ?v ?added
          :in $ ?e
          :where
          [?e :record/represents ?v ?tx ?added]]
        (d/history dbc)
        17592186102489)
   seq
   (sort-by first)))

;;Result
([13194139591378 17592186102485 true]
 [13194153775801 17592186102483 true]
 [13194153775801 17592186102485 false]
 [13194155029351 17592186102483 false])

What this tells us is that there were three transactions, 13194139591378,13194153775801,13194155029351, that made changes to the value. The first transaction set the record to represent ...85. The second transaction set the record to no longer represent ...85 but to represent ...83. And the third transaction just retracted the value entirely, thus causing the nil. Now we know that at one point the record did have a being and the record underwent resolution but something ended up turning it off for some reason.

Let's look at all the things that happended in that last transaction:

(let [dbc (db (d/connect uri))]
  (->> (d/q '[:find ?e ?a ?v ?op
              :in ?log ?t1
              :where
              [(tx-ids ?log ?t1 (inc ?t1)) [?tx]]
              [(tx-data ?log ?t1) [[?e ?a ?v _ ?op]]]
              ]
            (d/log (d/connect uri))
            13194155029351)
       seq
       (map #(update-in % [1] (partial ident-finder dbc)))
       (sort-by first)))

;;Result
([13194155029351 :db/txInstant #inst "2014-11-18T03:40:09.417-00:00" true]
 [17592186102483 :record/type 17592186045496 false]
 [17592186102483 :being/id "546a97cb-f456-4bd7-8697-f7af6238603d" false]
 [17592186102489 :record/represents 17592186102483 false])

So, for whatever reason, somehow only one retraction was issued during a merge without anything else happening. This screams race condition: something crazy was happening and nobody has any idea why. It took me a month and a half to realize what was going on though (I think). All of the merging is happening here. The merge functions are effectively closing over the same value of the database and generating all of the results at once, which means that, for example, two records could end up merging two different ways in the same step (regrettably, there is a certain amount of randomness with the parsing, and this is possible) where each record delete the other's beings. It's a weird bug and I think I know how to fix it (use reduce instead of pmap).

zmaril commented 9 years ago

Upon thinking about this, this still shouldn't happen. The union find we use should remove anything from the same step from deleting each other. Still no idea what is going on here.

zmaril commented 9 years ago

Now I think I have it. I was almost right before, a database value is getting closed over and not updated. It gets created here and contiually reused https://github.com/sunlightlabs/echelon/blob/master/src/echelon/core.clj#L139. This should be a simple fix to a big problem. Changing execute-merge-fn to something like this should do it:

(defn execute-merge-fn [conn f]
  (doseq [result (->> conn db f
                      (partition-all 1)
                      (pmap (partial d/transact conn)))]
    (try @result
         (catch Exception e
           (throw e)))))

On retrospect, it is weird to pass in both the connection and value of a database to a function like that.

boblannon commented 9 years ago

okay, so it sounds like this might be resolved? if so, when committing, mention the issue number so that GH will close the issue automatically and attach a link to the commit.

zmaril commented 9 years ago

This line was the reason for the weird transactions that only had a single datom in them. After I produced all the datoms for the merges and was loading a large number of records into the db, I started running into database timeouts. It was taking too long for the transaction to close and so the db was throwing timeout errors. Instead of debugging this properly I put in the partition-all function call which partitioned the transactions into their individual datoms and uploaded everything one by one. I didn't really think about this at the time, but this destroys any value of having datoms grouped by transactions and really wasn't a smart idea. I'm going back now and trying to find a better way to upload the transactions, many by trying to understand the size/volume of the transactions and how that caused timeouts.

zmaril commented 9 years ago

Preliminary results: the timeout errors came from trying to resolve entities with large number of associated records, in the several thousands at once. For example, it choked on merging ~2360 records which had the exact name "LOCKRIDGE GRINDAL NAUEN P.L.L.P.". After checking around again on the mailing list, I've found a property for datomic that allows me to modify the time before a transaction timeout. I've increased this by an order of magnitude and it seems to be making this much more possible.

zmaril commented 9 years ago

As expected, I had been wrecking the grouping of datoms within transactions. This change fixed that. The timeouts were addressed, the right way this time, by increasing the timeouts in the properties file. Investigating the missing being id's should now make more sense.

zmaril commented 9 years ago

Let's take a look at the unrepresented records further. Below starts with a document that we know has a missing being and ends with all the transactions that modified the part of the record (the client) that has a missing being. It uses both history and the transaction log to create this list.

(def document-without-represented-client
  "7e553ed2-a295-4ba5-9914-ac17b9e747f8")

(def unrepresented-client
  (ffirst (d/q '[:find ?client
                 :in $ ?document-id
                 :where
                 [?document         :lobbying.form/document-id ?document-id]
                 [?document         :lobbying.form/client ?client]]
               (db (d/connect uri))
               document-without-represented-client)))

(def modifying-transactions
  (->>
   (d/q '[:find ?tx
          :in $ ?e
          :where
          [?e _  _ ?tx]]
        (d/history (db (d/connect uri)))
        unrepresented-client)
   seq
   (map first)))

(let [dbc (db (d/connect uri))]
  (->> (d/q '[:find ?tx ?e ?a ?v ?op
              :in ?log [?t1 ...]
              :where
              [(tx-ids ?log ?t1 (inc ?t1)) [?tx]]
              [(tx-data ?log ?t1) [[?e ?a ?v _ ?op]]]
              ]
            (d/log (d/connect uri))
            modifying-transactions)
       seq
       (map #(update-in % [2] (partial ident-finder dbc)))
       (sort-by first)
       (breakout 2)))

This gives us a big data structure that I've broken down below. Note that each snippet is everything that happened in a transaction. The first level of the snippet is the entity that was affected, the second level is the attribute that was changed and the third level is the value for the attribute as well as whether the attribute was meant to be seen for the property. If this last value is true, that means the value is live for the entity and can be seen by queries. If false, this means that the value is retracted and won't be available through normal unhistorical queries.

The first transaction that modified the client entity was 13194139591378 and did the following

{17592186102492
 {:record/type {17592186045497 (true)},
  :parsed/name {"american society for nutrition" (true)},
  :lobbying.client/principal-place-of-business {17592186102494 (true)},
  :lobbying.client/main-address {17592186102493 (true)},
  :lobbying.client/name {"American Society for Nutrition" (true)},
  :record/represents {17592186102483 (true)},
  :lobbying.client/description {"" (true)}},
 17592186102489
 {:lobbying.registrant/organization-name {"American Society for Nutrition" (true)},
  :lobbying.registrant/principal-place-of-business {17592186102491 (true)},
  :record/represents {17592186102485 (true)},
  :parsed/name {"american society for nutrition" (true)},
  :lobbying.registrant/description {"Professional scientific society; Publisher of scientific journals" (true)},
  :lobbying.registrant/main-address {17592186102490 (true)},
  :record/type {17592186045498 (true)}},
 17592186102493 {:address/first-line {"" (true)}, :address/zipcode {"" (true)}, :address/city {"" (true)}, :address/state {"" (true)}, :address/country {"USA" (true)}},
 17592186102483 {:record/type {17592186045496 (true)}, :being/id {"546c1c2f-0488-4ed4-be28-dd432340c510" (true)}},
 17592186102484 {:record/type {17592186045496 (true)}, :being/id {"546c1c2f-535c-41dd-9a48-fac1bea02017" (true)}},
 17592186102487 {:record/type {17592186045496 (true)}, :being/id {"546c1c2f-7e3b-459b-96d0-446dc140f1d4" (true)}},
 17592186102491
 {:address/zipcode {"" (true)},
  :address/state {"" (true)},
  :lobbying.registrant/self-employed-individual {false (true)},
  :address/city {"" (true)},
  :address/country {"" (true)}},
 17592186102496
 {:data/position {0 (true)},
  :record/type {17592186045500 (true)},
  :lobbying.lobbyist/covered-official-position {"Legislative Correspondent, U.S. Sen. Zell Miller (D-GA)" (true)},
  :record/represents {17592186102487 (true)},
  :lobbying.lobbyist/last-name {"Watts" (true)},
  :lobbying.lobbyist/suffix {"" (true)},
  :lobbying.lobbyist/first-name {"Mary Lee" (true)}},
 17592186102485 {:record/type {17592186045496 (true)}, :being/id {"546c1c2f-07c8-4ef3-b9bb-8542f118b69d" (true)}},
 17592186102486 {:being/id {"546c1c2f-d881-4cd1-8b9a-023c690258b6" (true)}, :record/type {17592186045496 (true)}},
 17592186102490
 {:address/zipcode {"20814" (true)},
  :address/city {"Bethesda" (true)},
  :address/state {"MD" (true)},
  :address/first-line {"9650 Rockville Pike" (true)},
  :address/country {"USA" (true)},
  :address/second-line {"" (true)}},
 17592186102488
 {:lobbying.registration/new-client-for-existing-registrant {false (true)},
  :lobbying.form/registrant {17592186102489 (true)},
  :lobbying.form/client-registrant-same {true (true)},
  :lobbying.form/client {17592186102492 (true)},
  :lobbying.registration/senate-id {"400340581" (true)},
  :lobbying.form/amendment {false (true)},
  :lobbying.form/source {17592186045504 (true)},
  :lobbying.registration/activity {17592186102495 (true)},
  :lobbying.form/filepath {"/home/zmaril/new-data/sopr_html/2008/REG/7e553ed2-a295-4ba5-9914-ac17b9e747f8.json" (true)},
  :lobbying.registration/new-registrant {true (true)},
  :lobbying.registration/effective-date {#inst "2008-09-05T00:00:00.000-00:00" (true)},
  :lobbying.registration/house-id {"" (true)},
  :lobbying.form/contact {17592186102497 (true)},
  :lobbying.form/signature-date {#inst "2008-09-05T00:00:00.000-00:00" (true)},
  :lobbying.form/document-id {"7e553ed2-a295-4ba5-9914-ac17b9e747f8" (true)},
  :record/type {17592186045505 (true)}},
 17592186102495
 {:lobbying.activity/general-details
  {"Budget and appropriations for federal agencies: NIH, USDA, CDC\nLegislative issues related to nutrition labeling and programs" (true)},
  :lobbying.activity/lobbyists {17592186102496 (true)},
  :record/represents {17592186102486 (true)},
  :record/type {17592186045503 (true)}},
 17592186102497
 {:record/type {17592186045499 (true)},
  :lobbying.contact/phone {"3016347112" (true)},
  :lobbying.contact/email {"mwatts@nutrition.org" (true)},
  :lobbying.contact/name {"Mrs. Mary Lee Watts" (true)},
  :record/represents {17592186102484 (true)}},
 13194139591378 {:db/txInstant {#inst "2014-11-19T04:27:27.631-00:00" (true)}},
 17592186102494 {:address/zipcode {"" (true)}, :address/country {"" (true)}, :address/state {"" (true)}, :address/city {"" (true)}}}

The next transaction was 13194153859973 and did the following:

{17592186102492 {:record/represents {17592186102483 (false)}},
 17592188460118 {:record/represents {17592188460112 (false), 17592192068231 (true)}},
 13194153859973 {:db/txInstant {#inst "2014-11-24T06:23:12.053-00:00" (true)}},
 17592193900662 {:record/represents {17592193900655 (false)}},
 17592190831065 {:record/represents {17592192068231 (true), 17592190831054 (false)}},
 17592192613306 {:record/type {17592186045496 (false)}, :being/id {"546c270d-140e-4062-9ef4-6ba0710229a8" (false)}},
 17592193471478 {:record/represents {17592192068231 (true), 17592193471467 (false)}},
 17592186409585 {:record/type {17592186045496 (false)}, :being/id {"546c1c96-31c3-475e-8435-ae76bd081f4c" (false)}},
 17592186102489 {:record/represents {17592186102483 (false), 17592192068231 (true)}},
 17592193599752 {:being/id {"546c28b5-3ba1-457c-8221-ede714085ca7" (false)}, :record/type {17592186045496 (false)}},
 17592191728667 {:being/id {"546c2596-1cfb-4118-af21-f72fc8eb284c" (false)}, :record/type {17592186045496 (false)}},
 17592186102483 {:record/type {17592186045496 (false)}, :being/id {"546c1c2f-0488-4ed4-be28-dd432340c510" (false)}},
 17592188289590 {:record/represents {17592192068231 (true), 17592188289582 (false)}},
 17592193900655 {:record/type {17592186045496 (false)}, :being/id {"546c2937-3f0c-4236-a48c-fd1fc97d1994" (false)}},
 17592189480953 {:being/id {"546c21ed-0693-4ac2-81f4-5363c4f65fcb" (false)}, :record/type {17592186045496 (false)}},
 17592192613818 {:being/id {"546c270d-2057-4748-bfab-5615819b20e1" (false)}, :record/type {17592186045496 (false)}},
 17592190419480 {:record/represents {17592190419472 (false), 17592192068231 (true)}},
 17592188289582 {:being/id {"546c1ff5-0a43-49a6-af43-7f626fcd7b7a" (false)}, :record/type {17592186045496 (false)}},
 17592192613313 {:record/represents {17592192068231 (true), 17592192613306 (false)}},
 17592193464577 {:record/represents {17592192068231 (true), 17592193464571 (false)}},
 17592190419472 {:record/type {17592186045496 (false)}, :being/id {"546c2372-a8b0-4edd-afcd-0b04819ad9d4" (false)}},
 17592190419487 {:record/represents {17592190419472 (false)}},
 17592193464571 {:being/id {"546c287b-a1f1-44bb-aab0-34da4eb07e00" (false)}, :record/type {17592186045496 (false)}},
 17592189620796 {:record/represents {17592189620781 (false), 17592192068231 (true)}},
 17592188460112 {:record/type {17592186045496 (false)}, :being/id {"546c203d-f222-4cf8-90f7-41c23e03ae18" (false)}},
 17592189620781 {:record/type {17592186045496 (false)}, :being/id {"546c2226-955d-4e39-bdb9-066d5e1befbe" (false)}},
 17592193149485 {:record/type {17592186045496 (false)}, :being/id {"546c27f1-59c5-4505-bf3e-c2882419de56" (false)}},
 17592193149498 {:record/represents {17592192068231 (true), 17592193149485 (false)}},
 17592190831060 {:record/represents {17592190831054 (false)}},
 17592192613831 {:record/represents {17592192068231 (true), 17592192613818 (false)}},
 17592193599765 {:record/represents {17592193599752 (false), 17592192068231 (true)}},
 17592193464582 {:record/represents {17592193464571 (false)}},
 17592193900659 {:record/represents {17592192068231 (true), 17592193900655 (false)}},
 17592193471467 {:being/id {"546c287e-b2a1-4a49-9a2f-6a527225d331" (false)}, :record/type {17592186045496 (false)}},
 17592190831054 {:record/type {17592186045496 (false)}, :being/id {"546c241e-ca7d-4714-976c-6bd95d252a06" (false)}},
 17592189480959 {:record/represents {17592192068231 (true), 17592189480953 (false)}},
 17592188289597 {:record/represents {17592188289582 (false)}},
 17592191728679 {:record/represents {17592192068231 (true), 17592191728667 (false)}},
 17592189620789 {:record/represents {17592189620781 (false)}},
 17592188460123 {:record/represents {17592188460112 (false)}},
 17592186409595 {:record/represents {17592192068231 (true), 17592186409585 (false)}},
 17592193471473 {:record/represents {17592193471467 (false)}},
 17592192613310 {:record/represents {17592192613306 (false)}},
 17592191728690 {:record/represents {17592191728667 (false)}},
 17592189480964 {:record/represents {17592189480953 (false)}}}

The first transaction was the loading of the record while the second transaction was the merge step*. In the merge step, there are several entities that have the value of their :record/represents set to false without a new true value showing up. Looking further into this now.

*I need to add in information about the merge taking place and why it happened. I don't know why these records were being weaved together as a single being. Opened an issue on this.

zmaril commented 9 years ago

Might have found it. I had problems with shapes of nested vectors and was throwing out some records that should have been annotated correctly. Rerunning everything now.

zmaril commented 9 years ago

This seem to have drastically reduced the number of unannotated records. I didn't do a full count prior to the fix, but my impression was that most of the records didn't have beings. Now I can only find 32 records total that don't have annotations for being id's (still mastering jq though, so there might be some more).

As such, I'm going to say that, for most intents and purposes, all the json files are annotated and are ready to be used upstream.