unitedstates / BillMap

Utilities and applications for the FlatGov project by Demand Progress
Other
15 stars 2 forks source link

Organize Celery tasks #248

Closed aih closed 3 years ago

aih commented 3 years ago

We have a number of scrapers and processing tasks. We need to make sure they run efficiently and get only the latest changes, if possible. This is first an issue of analyzing the current scrapers and how they work. We need to know:

  1. How is the initial data loaded?
  2. How are updates made?

@ayeshamk, I'm asking Wei to organize this. He may have questions about individual scrapers.

We now have the following scrapers and processors:

weinicookpad commented 3 years ago

CREC scraper

url example - https://www.govinfo.gov/wssearch/getContentDetail?packageId=CRPT-117hrpt1&granuleId=CRPT-117hrpt1

title, pdf_link, Category, Report Type, Report Number, Date, Committee, Associated Legislation

Category, Report Type, Report Number, Date, Committee, Associated Legislation are in the response metadata.

When we run the Django command above, it calls crec_loader function in common/crec_data.py

It stores the data into CommitteeDocument table in database.

Statements of Administration Policy Scraper

When we run the Django command above, it calls load_statements function in common/biden_statements.py

It stores the data into Statement table in database.

CBO Scraper

Before running it, the django command automatically delete all the cbo instances in the database.

CRS Scraper

How Daily Updates.

  1. We run CRS Scraper and CBO Scraper daily in a easy way using celery scheduler.

  2. CREC and SAP scrapers were built with Scrapy.

We will need to integrate Scrapy with Django.

Here is the schema.

Client sends a request with a URL to crawl it. (1)

Django triggers Scrapy to run a spider to crawl that URL. (2)

Django returns a response to tell Client that crawling just started. (3)

Scrapy completes crawling and saves extracted data into a database. (4)

Django fetches that data from the database and return it to Client. (5)

In this way, we don't need to store data into json files in Scrapy anymore.

aih commented 3 years ago

Notes:

For the scrapy scrapers, we also want to:

aih commented 3 years ago

Also, for the crec scraper, I believe there is code that makes the crec_detail_urls.json.

@ayeshamk ?

kapphire commented 3 years ago

For Cbo scraper, we don't need to delete the data and recreate it. We can check the item before adding to the database by bill_number.

CRS scraper: We need to store the latest url from the csv file while running the celery task to avoid duplicates.

In that way, we can avoid duplicates in Scrapy.