geohotstan commented 3 months ago

36

How it works currently: The entry point is the run.py file. Run with python run.py You can check the current output at sample_house.md and sample_senate.md (should be reproducible when given the api key in member.py)

run.py first fetches all members of the current congress from https://api.congress.gov/v3 and creates a folder for each member inside /data with a json that details that member's descriptions. Then the disclosures from both House and Senate is scraped from (disclosures-clerk.house.gov) and (efdsearch.senate.gov) respectively for a given year defined in run.py, and the disclosures are parsed and added into the json files of the members (for the ones that are directly parsable) For the disclosures that contain pdfs/images, the images are saved inside the folder of the member.

The current outputted markdown files ignore image disclosures, hence why sample_house.md is empty. There are two TODOs:

[ ] extract disclosure in json form from images so that it can be rendered into markdown
[ ] parse out only cryptocurrency related holdings from json
[ ] cleanup

For the first TODO, the code is already implemented in extract.py. I think a VLM (visual language model) is suited for this task. I've tried plain OCR and that was really bad, and I'm pretty sure SOTA for this stuff is just VLMs. Doing a few shot prompt asking for parsed json should do the trick. Problem is, I don't have an OpenAI API key to test and run this. Not too sure what to do here.

For the second TODO, it seems LLMs aren't the best for this... I tried gpt4 by asking if ARKK is related to cryptocurrency and it wasn't sure. (ARKK holds coinbase in it's portfolio), so I need to think of another way....

small note: The interchangable_names.json is because often these members have/use different names when registering for member of congress and when uploading their disclosures. For example, GONZALES, ERNEST his name from https://api.congress.gov/v3 is "GONZALES, TONY" while his disclosures use "GONZALES, ERNEST".

I decided to get around this by writing a checker in run.py that asks if the two names are the same person which requires you to manually input the interchangeable name into that json...

geohotstan commented 3 months ago

https://gist.github.com/geohotstan/18ceeb2ee6a5fd965ac61c99ee6b7839 Here's the sample_senate.md since github diff doesn't like big files

jlopp commented 3 months ago

Nice; making good progress!

For the second TODO, it seems LLMs aren't the best for this... I tried gpt4 by asking if ARKK is related to cryptocurrency and it wasn't sure. (ARKK holds coinbase in it's portfolio), so I need to think of another way....

To be clear, I don't think the determination of which tickers / asset names are considered "crypto" needs to be dynamic. I'm perfectly happy with it being a statically set list, because that won't require frequent updates.

Problem is, I don't have an OpenAI API key to test and run this.

I will gladly set up an OpenAI account with API access if you think that will help accomplish this goal. Just let me know the best way to securely share an API key with you.

jlopp / bitcoin-politicians

[wip] bitcoin politician automation #37

36