This adds a second scraping function message-queue-scraper that is triggered by a new message on the queue cases-to-scrape. This second scraping function is necessary due to the 10 minute limit on Azure functions -- previously, when the first function http-scraper hit a day that contained 100 cases, it would time out. Now, that first function http-scraper passes off the scraping work to the second function message-queue-scraper when it hits a large number of cases.
Changes in this PR:
The logic in http-scraper: for each day, if there are 10 or less cases, just scrape as usual. If there are more than 10 cases, write a message to the queue containing all the info needed to scrape those cases. Each message contains 1 batch of case URLs, with the batch size set in configuration. (Be sure to add "cases_batch_size":50 to your local.settings.json)
message-queue-scraper does contain 2 search requests prior to actually scraping the list of case URLs, because Odyssey requires a new session to do that.
http-scraper/__init__.py is now just 1 function. (eg the scrape function has been merged into the main function.) This was necessary in order to keep the message queue output binding in scope. Also, because of this merging into 1 function, the loop for date in <date-range> had to change to for day in <date-range> to avoid naming conflict with Python's native datetime lib.
Added azure-cosmos to requirements-txt -- necessary for blob-parser to write to Cosmos DB
This adds a second scraping function
message-queue-scraper
that is triggered by a new message on the queuecases-to-scrape
. This second scraping function is necessary due to the 10 minute limit on Azure functions -- previously, when the first functionhttp-scraper
hit a day that contained 100 cases, it would time out. Now, that first functionhttp-scraper
passes off the scraping work to the second functionmessage-queue-scraper
when it hits a large number of cases.Changes in this PR:
http-scraper
: for each day, if there are 10 or less cases, just scrape as usual. If there are more than 10 cases, write a message to the queue containing all the info needed to scrape those cases. Each message contains 1 batch of case URLs, with the batch size set in configuration. (Be sure to add"cases_batch_size":50
to yourlocal.settings.json
)message-queue-scraper
does contain 2 search requests prior to actually scraping the list of case URLs, because Odyssey requires a new session to do that.http-scraper/__init__.py
is now just 1 function. (eg thescrape
function has been merged into the main function.) This was necessary in order to keep the message queue output binding in scope. Also, because of this merging into 1 function, the loopfor date in <date-range>
had to change tofor day in <date-range>
to avoid naming conflict with Python's native datetime lib.azure-cosmos
torequirements-txt
-- necessary forblob-parser
to write to Cosmos DB