Organize Celery tasks - Githubissues

aih commented 3 years ago

We have a number of scrapers and processing tasks. We need to make sure they run efficiently and get only the latest changes, if possible. This is first an issue of analyzing the current scrapers and how they work. We need to know:

How is the initial data loaded?
How are updates made?

@ayeshamk, I'm asking Wei to organize this. He may have questions about individual scrapers.

We now have the following scrapers and processors:

[x] 1. US Congress bill scraper
[x] 2. Scripts to process bills and create json with bill metadata (I am working to update this for performance)
[x] 3. Loading the metadata into the database
[x] 4. CBO reports (Ayesha)
[x] 5. CRS reports (@idmitryv)
[x] 6. Committee Documents (crec_loader, Ayesha)
[x] 7. Statemement of Administration Policy (Ayesha)
- This consists of a loader with fixtures for previous administrations (Trump and Obama), and a scraper for the current administration (Biden)

weinicookpad commented 3 years ago

CREC scraper

There is a json file that includes urls to crawl named crec_detail_urls.json

url example - https://www.govinfo.gov/wssearch/getContentDetail?packageId=CRPT-117hrpt1&granuleId=CRPT-117hrpt1

We send request to the url and parse the response with the fields below.

title, pdf_link, Category, Report Type, Report Number, Date, Committee, Associated Legislation

Category, Report Type, Report Number, Date, Committee, Associated Legislation are in the response metadata.

The scraper store those data as a json in the crec_data.json file.
Run Django command ./manage.py load_crec to store them into database.

When we run the Django command above, it calls crec_loader function in common/crec_data.py

It stores the data into CommitteeDocument table in database.

Statements of Administration Policy Scraper

The scraper goes to https://www.whitehouse.gov/omb/statements-of-administration-policy/
Get the urls on the page above and store them into ../server_py/flatgov/biden_data.json file.
Run Django command ./manage.py biden_statements to store them into database.

When we run the Django command above, it calls load_statements function in common/biden_statements.py

It stores the data into Statement table in database.

CBO Scraper

Run Django command ./manage.py load_cbo

Before running it, the django command automatically delete all the cbo instances in the database.

It stores data into CboReport table in database

CRS Scraper

See CRS_REPORT.adoc

How Daily Updates.

We run CRS Scraper and CBO Scraper daily in a easy way using celery scheduler.
CREC and SAP scrapers were built with Scrapy.

We will need to integrate Scrapy with Django.

Here is the schema.

Client sends a request with a URL to crawl it. (1)

Django triggers Scrapy to run a spider to crawl that URL. (2)

Django returns a response to tell Client that crawling just started. (3)

Scrapy completes crawling and saves extracted data into a database. (4)

Django fetches that data from the database and return it to Client. (5)

In this way, we don't need to store data into json files in Scrapy anymore.

aih commented 3 years ago

Notes:

For the Cbo scraper, it seems inefficient to delete the data and recreate it. Can we change the scraper to scrape by date, and only update the most recent? Or to check before adding to the database, whether the item already exists?
Similarly for the crs scraper, let's see if we can avoid re-scraping

For the scrapy scrapers, we also want to:

only scrape the most recent
if we need to store to a temporary json file, that's ok. Better, of course, if we store straight to the db

aih commented 3 years ago

Also, for the crec scraper, I believe there is code that makes the crec_detail_urls.json.

@ayeshamk ?

kapphire commented 3 years ago

For Cbo scraper, we don't need to delete the data and recreate it. We can check the item before adding to the database by bill_number.

CRS scraper: We need to store the latest url from the csv file while running the celery task to avoid duplicates.

In that way, we can avoid duplicates in Scrapy.

unitedstates / BillMap

Organize Celery tasks #248

CREC scraper

Statements of Administration Policy Scraper

CBO Scraper

CRS Scraper

How Daily Updates.