Open geohotstan opened 3 months ago
https://gist.github.com/geohotstan/18ceeb2ee6a5fd965ac61c99ee6b7839
Here's the sample_senate.md
since github diff doesn't like big files
Nice; making good progress!
For the second TODO, it seems LLMs aren't the best for this... I tried gpt4 by asking if ARKK is related to cryptocurrency and it wasn't sure. (ARKK holds coinbase in it's portfolio), so I need to think of another way....
To be clear, I don't think the determination of which tickers / asset names are considered "crypto" needs to be dynamic. I'm perfectly happy with it being a statically set list, because that won't require frequent updates.
Problem is, I don't have an OpenAI API key to test and run this.
I will gladly set up an OpenAI account with API access if you think that will help accomplish this goal. Just let me know the best way to securely share an API key with you.
36
How it works currently: The entry point is the
run.py
file. Run withpython run.py
You can check the current output atsample_house.md
andsample_senate.md
(should be reproducible when given the api key inmember.py
)run.py
first fetches all members of the current congress fromhttps://api.congress.gov/v3
and creates a folder for each member inside/data
with a json that details that member's descriptions. Then the disclosures from both House and Senate is scraped from (disclosures-clerk.house.gov) and (efdsearch.senate.gov) respectively for a given year defined inrun.py
, and the disclosures are parsed and added into the json files of the members (for the ones that are directly parsable) For the disclosures that contain pdfs/images, the images are saved inside the folder of the member.The current outputted markdown files ignore image disclosures, hence why
sample_house.md
is empty. There are two TODOs:For the first TODO, the code is already implemented in
extract.py
. I think a VLM (visual language model) is suited for this task. I've tried plain OCR and that was really bad, and I'm pretty sure SOTA for this stuff is just VLMs. Doing a few shot prompt asking for parsed json should do the trick. Problem is, I don't have an OpenAI API key to test and run this. Not too sure what to do here.For the second TODO, it seems LLMs aren't the best for this... I tried gpt4 by asking if ARKK is related to cryptocurrency and it wasn't sure. (ARKK holds coinbase in it's portfolio), so I need to think of another way....
small note: The
interchangable_names.json
is because often these members have/use different names when registering for member of congress and when uploading their disclosures. For example, GONZALES, ERNEST his name fromhttps://api.congress.gov/v3
is "GONZALES, TONY" while his disclosures use "GONZALES, ERNEST".I decided to get around this by writing a checker in
run.py
that asks if the two names are the same person which requires you to manually input the interchangeable name into that json...