Open waldoj opened 6 years ago
Note that the URL defined in ThumbnailUri
just 404s.
Also note that this event is scheduled to go for just short of 24 hours, so that range doesn't look real reliable.
Uh. It looks like there's no bulk downloads? Just streaming?
This is very doable. Here's a helpful script to grab the video—turns out you can just concat all of the MP4s together!
I think the necessary process here is to build on the existing infrastructure:
rs-machine
that retrieves upcoming schedules and stores them in MySQL, in the meetings
table (that is, matching those records with the existing schedule data)rs-machine
script that gets the URL for those videos, when they're done, stores the URL in SQS, and starts the rs-video-processor
instancers-video-processor
that grabs queued videosfiles.id
of the resulting video in meetings
, to close the loopThe URL for a given video must be constructed from the data file—it's http://sg001-harmony.sliq.net/00304/Harmony/en/PowerBrowser/PowerBrowserV2/YYYYMMDD/-1/ID
, with YYMMDD
and ID
available as fields in the data file.
The actual URL to retrieve video from is in the page body, in script
tags, e.g.:
var availableStreams = [{"GlobalEssenceFormatId":4,"IsLive":false,"Enabled":true,"AudioOnly":false,"VideoIndex":null,"AudioIndex":null,"StreamFormatId":12,"Url":"http://sg002-livein01.sliq.net/00304-vod/_definst_/2017/12/18/Appropriations_2017-12-18-09.00.00_2115_12.mp4/playlist.m3u8","Lang":"","StreamAssemblerList":null,"PreRoll":0.0,"Duration":9662,"Id":2239,"Tag":"Video"}];
So get the Url
value from that, hack off /playlist.m3u8
, and iterate from there.
I skimmed through this test video, helpfully recorded by legislative staff. Here are the three types of chyrons that I saw:
Large and small, basically—I think the second two are just variations of the same thing. (One has Secretary of Finance
written under the caption text.) The large one has text running under the seal. It's a fair bet that the purpose of this test run was to identify those problems, and that they won't be an issue in production.
So, really, just two types of chryons. I think the best test will be to check for the presence of two blue pixels. If the bottom left pixel is blue, then it's a short chyron, and crop accordingly. If it isn't, but if a pixel above and to the left is blue, then it's a large chyron and, again, crop accordingly.
I'm dubious that the format of the top text is established at this point, but it should be pretty easy to extract. Bill number and patron. The bottom text could be useful as a sanity check, in case of an OCR error for the top text, since that's the bill's catch line.
Huh. Here's a completely different approach to chyron-text placement, from today's test video. (There's no real video just yet.)
Looks like the tick-tock can be grabbed from the page source itself, defined as dataModel
.
Oh, lawd...new chyron styles for the Senate.
The bill text is all stretchy, the chyrons are smaller, and the video is flipped horizontally, for some reason? Ugh.
There's now a streaming video interface for committee meetings! For the House, it's completely different than the one for floor video, but for the Senate, it's Granicus. But the good news is that the House vendor has JSON representations of the data. So this view of the next week's scheduled videos also includes this JSON representation. (There's also monthly JSON.)
At this moment, the governor is speaking in Appropriations, and this is the JSON representation:
Seems to me that there are three things to be done: