Systemic changes to handle older data being pulled in so that we can track changes to more DAs

jamezpolley commented 5 years ago

Tracking changes to applications has several benefits; the most immediate one is that if an Authority updates the details of a DA shortly after they upload it, we'll catch and display the update. In particular, this makes it more likely that we'll be able to scraper notification start/end dates and decision dates, as those are often not known at the time we first scrape the DA but are often added within a few days or weeks.

It also means we have the potential to track the final outcome of a DA (approved/rejected), as well as tracking it through the process.

However, most scrapers currently only look at a short window of time in the immediate past. In many cases, this is because Authority websites make it hard to find older DAs. However, even when the Authority site makes it easy to scrape the older DAs, there's never been any reason to do so as we would have ignored the older alerts that we'd seen before anyway.

With #1317 implemented, this has changed and there's now value in us going back and re-scraping those DAs. If we don't, we'll end up with alerts with out-of-date information - eg, we'll never be able to report on whether something was approved/rejected if it takes more than the 7/14/30 days that our scraper is looking at, and we'll miss any updates that happen outside that window.

In order to do this completely, we'll have to look at each individual Authority and determine if it's possible to get at the older data. I'm not going to try to address that here; it's going to have to be case-by-case for each authority.

This ticket is to track the PlanningAlerts changes we might need to make to handle the older data, including documentation changes we might want to make to give instructions on what we expect from a scraper.

[ ] First up - do we actually want to scraper data older than 30 days at all? IF we don't, the rest of this becomes fairly simple. Assuming we do want to scrape older data: Should it be scraped:
[ ] every time the scraper runs, or
[ ] only periodically?

If peridically, how do we do that? I'm guessing something like having the scraper track when it last did a full scrape (in a seperate table in sqlite perhaps?) and do a full scrape once every week or once per month or something.

If every time, that's going to mean that each scraper takes considerably longer to run and it's going to result in considerably more traffic on both our side and the Authority's site.

If we do want to scrape older data, we're going to have instances where a large number of older DAs are pulled in. We've seen that now with new scrapers being added; and with broken scrapers being repaired. This can create confusion when we send out emails telling someone there are hundreds of new applications in their area, or when we tell them there's a new application on their property that they didn't know about.

[ ] Do something to surpress (or flag) older DAs to prevent confusion

If we do scrape the older data, we'll want to start tracking some bits of data we haven't really looked at before now: for instance, most Authority websites have some kind of "Status" field which shows where in its lifecycle the application is; and many will have some way of showing the date when the DA entered that state.

[ ] Extend (and document) the standard Scraper schema to include new fields
[ ] Note councils that don't provide this extended information; commercial users will want to know which councils don't have this data.

If we don't scrape older data, we'll lose a big chunk of the commercial value of the data as we'll be missing most final outcomes and lots of state changes. However, if we can only get older data; or only get the extended data, from a minority of sites, it may not be worth our time even trying to do this.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because there has been no activity on it for a year. If you want to keep it open please make a comment and explain why this issue is still relevant. Otherwise it will be automatically closed in a week. Thank you!

mlandauer commented 2 years ago

I have very mixed feelings about handling historic data. The perfectionist in me wants all the data in the system to be the best it can be. The pragmatist in me accepts that all the data is inherently flawed and wrong to different degrees, either by the way it's been entered at the council or the way it's scraped. So, for that reason we should not try to update old stuff.

So, far the pragmatist has won out.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because there has been no activity on it for about six months. If you want to keep it open please make a comment and explain why this issue is still relevant. Otherwise it will be automatically closed in a week. Thank you!

openaustralia / planningalerts

Systemic changes to handle older data being pulled in so that we can track changes to more DAs #1382