Draft #7: Events - Githubissues

jamesturk commented 3 years ago

Currently a very lightweight proposal to just restore them to the API, etc.

@showerst curious if you have any thoughts on changes to the schema/etc. that you'd like to see as part of us working on events again

showerst commented 3 years ago

Great news!

I'll drop these in as comments here, happy to convert that to an OSPEP later if you want. For the most part event scrapers are in pretty good shape, I don't feel that there are any deal-breaking holes in the current structure.

The biggest one, we should standardize how windowing works. I'll break that out below.
Datetimeoptional works fine for things like bill actions, but it's a little more error prone for events. We have client-side logic to say "if an event is scheduled between midnight and 5 am local time don't show the time component" but an explicit "This event does not have a time component" would be nice. Sometimes you want to flag those 1am special session vote-a-thons or whatever.
Right now bills can only be linked to event_agenda items. This is cumbersome when an event doesn't have an itemized agenda, but does have bills. We've been creating a bills under consideration placeholder agenda item, but maybe there's a better way to do this? I do like the ability to link them to specific agenda items, so maybe the current way is best. One approach might be a utility method event.add_bill that creates a bill agenda item if it doesn't exist and adds the bill to it.
Since votes (can) have an explicit id now, maybe we should add a method to link them up? I'm not 100% that there are states that make this possible, but i'd check on CO and MA as possibilities.
It would be great to extend the event.add_document method with some kind of category variable. "Agenda", "Minutes", "Bill Draft", "Amendment", "Budget", "Testimony", "Transcript" and "Witness List" come to mind offhand. As with action classifications my ideal here is a preset list but the ability to turn off that validation and add an arbitrary string if you really want to.
This is more of a general scraper thing and maybe not germane here, but a way to return "No results, but this is not a scraper error" would be great for event scrapers. I end up filtering out tons of openstates.exceptions.ScrapeError: no objects returned when sessions are out, but it can be finnicky un-filter them at the right time when we do expect return vals.
add_media_item is getting more useful over time as states put this stuff online. It would be helpful to expand it to have a category: "video", "audio", "hosted video", "hosted audio" or something like that. The hosted vs non would be to discriminate from a raw stream that can be downloaded or embedded directly, vs some page with a player on it. I'd even create a special category for youtube since that's a common special case.

Windowing

In some states this is easy because we just scrape what's in some list, but in many states we have to submit a form with a specific date range. To speed up scraping we've tried to pick sensible defaults (60 to 90 days in both directions depending on the state), but it's pretty inconsistent right now. AR is a good example

Rather than repeat variations of that code 50 times, we should have some unified utility function there, that takes either a (start, end) tuple, or a (start, days_before, days_after) tuple with a default start of today. My logic on days before/days after being separate is that states usually post stuff like minutes and agendas back to the site within a month past events, but we might want to grab events 90 days in the future without needlessly scraping older stuff.

jamesturk commented 3 years ago

Thanks for this! Near complete agreement on these, updated the proposal for 1, 3, and 7.

A couple of questions on the others:

For point 2, I see the point & agree but any thoughts on how you'd want to do this in a backwards compatible way? Or would we want to break compat here and add start_date & start_time as separate attrs & update the scrapers to work with that?

I think I might have to punt on number 4 for now, I like the idea but figuring out how that vote matching would work might be more worthy of a mini EP of its own.

For point 5, configurable validation of the classifications might best be left to its own EP as well. But I'd be glad to add this initial subset. Any objection to that for now, also thoughts on them being lower case to match the way classifications work elsewhere?

Finally, I'll take a look at point 6, probably doesn't require a whole separate EP but I'll see if there's a fairly straightforward fix for that. I've opened an issue here (https://github.com/openstates/issues/issues/480) to investigate that.

jessemortenson commented 3 years ago

Great feedback, thanks Tim. I like the proposal and compromises James suggested above.

Datetimeoptional works fine for things like bill actions...

Forgive me for being a bit unfamiliar, but what kinds of events are being scraped without at least a date? I wouldn't expect events to all have an end date, but I would expect them to at least have a start date. Or are we talking about events that have a start date, but no start time? I'm not clear on the real-world tradeoffs in this part of the discussion

jamesturk commented 3 years ago

Based on feedback, I've added the concept of an upstream_id that can be scraped and explicitly noted that we'll be using soft deletes within the import (no changes on the scraper side for that).

showerst commented 2 years ago

I had a realization here while I was out, perhaps my DateTimeOptional issue is just with the JSON output.

Just as a quick example, start a WV events scrape, let it knock through the first committee, and search the output files for Subcommittee Report on Committee Substitute SB413 The actual event is the "Senate Agriculture and Rural Development Committee" on 2021-09-02

Due to an issue parsing the time provided by the state, it only has a date. But it's coming through the JSON as "start_date": "2021-09-02T04:00:00+00:00" (midnight EST). Rather than something like "2021-09-02"

If it's working fine in the database driven copy, then this isn't really an OS issue and I can either handle it with an extras key or do something on our end.

jamesturk commented 2 years ago

I'll take a look at that, I don't think it should be coming through like that in the JSON. Considering that a bug for now.

jamesturk commented 2 years ago

Also going to merge this as draft since work is now in progress, #34 will stay open for continued discussion & improvement.

jamesturk commented 2 years ago

I looked into the date vs datetime issue, and if a datetime.date (or string) is given to start_time, it does indeed come out properly in the JSON (as just the Y-m-d format) whereas in WV it looks like no matter what it is getting a datetime.datetime, which will print with the full time. I think this is the right behavior, and is what is used across all the JSON output, but we can discuss that on another issue if we want to talk about potential changes there.

showerst commented 2 years ago

Oh, that makes sense. I think i'd just been naively always using datetime and assuming it would smart trim, but this solution is better. I'll note to self/staff that one next time we messing with those.

showerst commented 2 years ago

Other quick response I'd missed:

For point 5, configurable validation of the classifications might best be left to its own EP as well. But I'd be glad to add this initial subset. Any objection to that for now, also thoughts on them being lower case to match the way classifications work elsewhere?

Sounds good to me!

openstates / enhancement-proposals

Draft #7: Events #28

Windowing