Closed aih closed 3 years ago
crec_detail_urls.json
url example - https://www.govinfo.gov/wssearch/getContentDetail?packageId=CRPT-117hrpt1&granuleId=CRPT-117hrpt1
title
, pdf_link
, Category
, Report Type
, Report Number
, Date
, Committee
, Associated Legislation
Category
, Report Type
, Report Number
, Date
, Committee
, Associated Legislation
are in the response metadata.
The scraper store those data as a json in the crec_data.json
file.
Run Django command ./manage.py load_crec
to store them into database.
When we run the Django command above, it calls crec_loader
function in common/crec_data.py
It stores the data into CommitteeDocument
table in database.
The scraper goes to https://www.whitehouse.gov/omb/statements-of-administration-policy/
Get the urls on the page above and store them into ../server_py/flatgov/biden_data.json
file.
Run Django command ./manage.py biden_statements
to store them into database.
When we run the Django command above, it calls load_statements
function in common/biden_statements.py
It stores the data into Statement
table in database.
./manage.py load_cbo
Before running it, the django command automatically delete all the cbo instances in the database.
CboReport
table in databaseWe run CRS Scraper and CBO Scraper daily in a easy way using celery scheduler.
CREC and SAP scrapers were built with Scrapy.
We will need to integrate Scrapy with Django.
Here is the schema.
Client sends a request with a URL to crawl it. (1)
Django triggers Scrapy to run a spider to crawl that URL. (2)
Django returns a response to tell Client that crawling just started. (3)
Scrapy completes crawling and saves extracted data into a database. (4)
Django fetches that data from the database and return it to Client. (5)
In this way, we don't need to store data into json files in Scrapy anymore.
Notes:
For the scrapy scrapers, we also want to:
Also, for the crec scraper, I believe there is code that makes the crec_detail_urls.json.
@ayeshamk ?
For Cbo scraper, we don't need to delete the data and recreate it. We can check the item before adding to the database by bill_number.
CRS scraper: We need to store the latest url from the csv file while running the celery task to avoid duplicates.
In that way, we can avoid duplicates in Scrapy.
We have a number of scrapers and processing tasks. We need to make sure they run efficiently and get only the latest changes, if possible. This is first an issue of analyzing the current scrapers and how they work. We need to know:
@ayeshamk, I'm asking Wei to organize this. He may have questions about individual scrapers.
We now have the following scrapers and processors:
crec_loader
, Ayesha)