pyvideo / old-pyvideo-data

DEPRECATED: Video data for Python related conferences
Other
107 stars 38 forks source link

missing conference data #20

Closed codersquid closed 8 years ago

codersquid commented 8 years ago

We are missing some data from pyvideo. I don't see pyohio or pygotham data, for example.

willkg commented 8 years ago

How curious.

Can someone make a list of pyvideo urls of conferences that are missing in pyvideo-data data? This can be done by anyone--just add a comment to this issue.

Once that's done (anyone can do it--just add a comment), then I can go tinker with the extraction script and see what happened.

cldershem commented 8 years ago

I have been working on this data since it was announced that pyvideo.org was going leaving us. I have also observed at least some videos are missing from pyvideo-data (as of my last pull of it a while back). To work around this issue I have scraped all the data from the API and found that all of the missing videos (at least the ones I noticed) were there.

From a quick search, this repo contains about 2800 videos while PyVideo.org claims 3439 videos...leaving ~700 videos unaccounted for. Which is an odd thing because, I also believe there are about 700 videos that did not have any form a url (source, mp4, flv, etc) outside of the Rackspace CDN. There also appears to be quite a bit of overlap between those videos and the ones missing in this data. I'm not sure why their data would be missing from this repository, but I am certain that all of the data is available via the API.

I have a sqlite db with all of the data from the API and it's in the middle of a big clean up to correct various pieces. I also have a copy of nearly all the 700 videos (112 were bad urls or hosted on archive.org) mentioned above because I feared if pyvideo shut down that we might not be able to find copies of them.

I can share the data I have, but as I mentioned, it's in the middle of being cleaned up (flagging missing data, fixing broken urls, expanding shortened urls, etc) and therefore doesn't 100% match the data that is available on the API or currently in this repo.

willkg commented 8 years ago

@cldershem Great! For this issue, we need a list of which conferences are on pyvideo that aren't on pyvideo-data. Can you produce that list from what you have? Making a bullet list in the comments on this issue would be fine.

khorn commented 8 years ago

All PyTexas videos appear to be missing from this repo:

Note that PyTexas 2015 exists on pyvideo.org but does not have any videos associated with it, as I don't think the metadata was ever submitted. So that's not "missing" per se.

codersquid commented 8 years ago

@khorn it looks like the metadata for PyTexas 2015 is there but in draft mode. Is someone from the conference responsible for approving the videos to be posted?

codersquid commented 8 years ago

@khorn forget I asked. I lost track of who added the data and thought it was Carl.

it looks like I added the data and never finished cleaning up descriptions scraped from youtube and their conference page. Maybe I should just flip them all out of draft mode at this point.

khorn commented 8 years ago

@codersquid I don't know whether there was ever any "approval" given for posting the videos, but as I'm one of the organizers and sit on the board of the PyTexas Foundation, I suppose I can give it.

Feel free to make them "live" (as opposed to "draft") whenever you're ready. If there's some problem we can fix it later (just tell me where to send the "bug" request).

codersquid commented 8 years ago

@khorn thanks kindly. They are "live" on pyvideo now and will get pulled here when the extraction script is rerun.

khorn commented 8 years ago

I'm happy to (well maybe not happy to, but willing to) make a manual list of whatever other conferences are missing, but I think I'll see whether @cldershem 's API-based approach bears fruit first.

khorn commented 8 years ago

@codersquid Awesome, thanks!

codersquid commented 8 years ago

@khorn you're welcome.

@cidershem we've got an api for fetching all the categories, http://pyvideo.org/api/v2/category/ and the directory names in this repo under data correspond to the category slug names.

willkg commented 8 years ago

@codersquid mentioned on IRC that we had found and fixed an issue in steve back in October and that it's probably the case that I ran the extractor script with a version of steve that didn't have that fix.

I rebuilt my virtualenv with that version of steve and re-extracted. The missing categories are:

willkg commented 8 years ago

Thank you everyone!