Closed jamesturk closed 2 years ago
Great news!
I'll drop these in as comments here, happy to convert that to an OSPEP later if you want. For the most part event scrapers are in pretty good shape, I don't feel that there are any deal-breaking holes in the current structure.
bills under consideration
placeholder agenda item, but maybe there's a better way to do this? I do like the ability to link them to specific agenda items, so maybe the current way is best. One approach might be a utility method event.add_bill
that creates a bill agenda item if it doesn't exist and adds the bill to it.openstates.exceptions.ScrapeError: no objects returned
when sessions are out, but it can be finnicky un-filter them at the right time when we do expect return vals.In some states this is easy because we just scrape what's in some list, but in many states we have to submit a form with a specific date range. To speed up scraping we've tried to pick sensible defaults (60 to 90 days in both directions depending on the state), but it's pretty inconsistent right now. AR is a good example
Rather than repeat variations of that code 50 times, we should have some unified utility function there, that takes either a (start, end) tuple, or a (start, days_before, days_after) tuple with a default start of today. My logic on days before/days after being separate is that states usually post stuff like minutes and agendas back to the site within a month past events, but we might want to grab events 90 days in the future without needlessly scraping older stuff.
Thanks for this! Near complete agreement on these, updated the proposal for 1, 3, and 7.
A couple of questions on the others:
For point 2, I see the point & agree but any thoughts on how you'd want to do this in a backwards compatible way? Or would we want to break compat here and add start_date & start_time as separate attrs & update the scrapers to work with that?
I think I might have to punt on number 4 for now, I like the idea but figuring out how that vote matching would work might be more worthy of a mini EP of its own.
For point 5, configurable validation of the classifications might best be left to its own EP as well. But I'd be glad to add this initial subset. Any objection to that for now, also thoughts on them being lower case to match the way classifications work elsewhere?
Finally, I'll take a look at point 6, probably doesn't require a whole separate EP but I'll see if there's a fairly straightforward fix for that. I've opened an issue here (https://github.com/openstates/issues/issues/480) to investigate that.
Great feedback, thanks Tim. I like the proposal and compromises James suggested above.
- Datetimeoptional works fine for things like bill actions...
Forgive me for being a bit unfamiliar, but what kinds of events are being scraped without at least a date? I wouldn't expect events to all have an end date, but I would expect them to at least have a start date. Or are we talking about events that have a start date, but no start time? I'm not clear on the real-world tradeoffs in this part of the discussion
Based on feedback, I've added the concept of an upstream_id
that can be scraped and explicitly noted that we'll be using soft deletes within the import (no changes on the scraper side for that).
I had a realization here while I was out, perhaps my DateTimeOptional issue is just with the JSON output.
Just as a quick example, start a WV events scrape, let it knock through the first committee, and search the output files for Subcommittee Report on Committee Substitute SB413
The actual event is the "Senate Agriculture and Rural Development Committee" on 2021-09-02
Due to an issue parsing the time provided by the state, it only has a date. But it's coming through the JSON as "start_date": "2021-09-02T04:00:00+00:00" (midnight EST). Rather than something like "2021-09-02"
If it's working fine in the database driven copy, then this isn't really an OS issue and I can either handle it with an extras key or do something on our end.
I'll take a look at that, I don't think it should be coming through like that in the JSON. Considering that a bug for now.
Also going to merge this as draft since work is now in progress, #34 will stay open for continued discussion & improvement.
I looked into the date vs datetime issue, and if a datetime.date (or string) is given to start_time, it does indeed come out properly in the JSON (as just the Y-m-d format) whereas in WV it looks like no matter what it is getting a datetime.datetime, which will print with the full time. I think this is the right behavior, and is what is used across all the JSON output, but we can discuss that on another issue if we want to talk about potential changes there.
Oh, that makes sense. I think i'd just been naively always using datetime and assuming it would smart trim, but this solution is better. I'll note to self/staff that one next time we messing with those.
Other quick response I'd missed:
For point 5, configurable validation of the classifications might best be left to its own EP as well. But I'd be glad to add this initial subset. Any objection to that for now, also thoughts on them being lower case to match the way classifications work elsewhere?
Sounds good to me!
Currently a very lightweight proposal to just restore them to the API, etc.
@showerst curious if you have any thoughts on changes to the schema/etc. that you'd like to see as part of us working on events again