pyvideo / data

Python related videos and metadata powering PyVideo.
https://pyvideo.org
Creative Commons Zero v1.0 Universal
451 stars 265 forks source link

PyCon US 2018 #494

Closed Daniel-at-github closed 6 years ago

Daniel-at-github commented 6 years ago

May 9-17 Cleveland

 - title: PyCon US 2018 
   dir: pycon-us-2018
   youtube_list: https://www.youtube.com/channel/UCsX05-2sVSH7Nx3zuk3NYuQ/videos
   language: en
   related_urls:
   - label: schedule
     url: https://us.pycon.org/2018/schedule/talks/
jonemo commented 6 years ago

I asked on Twitter if they have the talk info in a structured format somewhere: https://twitter.com/jonemo/status/996113914743013376 🤞

logston commented 6 years ago

Hey @jonemo, that's great but our youtube scraping tool may do the trick.

Want to give it a go and let us know? https://github.com/pyvideo/data/blob/master/tools/youtube.py

jonemo commented 6 years ago

@logston: I've done the Youtube scraping route twice for smaller conferences and would rather not do it for a big event like Pycon (~100 talks + keynotes + lightning talks + tutorials). Frankly, I also don't think that Youtube scraping alone is a viable approach to shrink the backlog.

A few specific reasons why access to the data from the conference management system would be preferable over the Youtube data for Pycon:

I hope that explains why I decided to try asking for the raw data first. If there is no response or a negative response and nobody else has done the Youtube scraping yet, I'll probably do it on Wednesday night.

One more thought: My understanding is that several regional Pycon conferences use the same software for managing the talk schedule as Pycon US. If we can figure out a way to get the raw data from Pycon US, we might be able to use their process as playbook with the other conferences that use the same software.

Daniel-at-github commented 6 years ago
  • The talks aren't even in a playlist yet (but I'm sure there is a nifty way to quickly create one)

Not a problem to scrape it.

  • Talk languages are not in the Youtube data (it looks like Pycon had some Spanish language talks this year)

Yes. Manual labor without conference data.

  • Talk dates are not in the Youtube data

Yes. They are not perfect, but they are good enough. It's too much effort as it is.

  • Speaker names have to be pulled out of talk titles or descriptions (easy to automate and easy to automate incorrectly for edge cases like multiple speakers and speaker names with special character)

Done

  • Links to abstract pages are missing from Youtube and non-trivial to scrape and then correlate

Yes. Manual labor without conference data.

  • Afaik the Speakerdeck links are available in the system, but not in the Youtube description

Yes. Manual labor without conference data. Generic links are provided to ease access to it and (hopefully) future contributions.

Daniel-at-github commented 6 years ago

Done the "automatic" part of #495 Tomorrow will try to continue. If anyone wants to correct a chunk of the talks, just say what files.

logston commented 6 years ago

Thanks a bunch for you both of your input here @Daniel-at-github, @jonemo

Daniel-at-github commented 6 years ago

Now working in the files:

pycon-us-2018/videos/adam-englander-practical-api-security-pycon-2018.json
...
pycon-us-2018/videos/dustin-ingram-inside-the-cheeseshop-how-python-packaging-works-pycon-2018.json
jonemo commented 6 years ago

Apologies for extrapolating from my own experience to others. Looks like everyone is happy with the Youtube scraping process!

jonemo commented 6 years ago

The talks aren't even in a playlist yet (but I'm sure there is a nifty way to quickly create one)

Not a problem to scrape it.

@Daniel-at-github: Can you add instructions for how this works to the docs for future newbies like myself? Right now the readme at https://github.com/pyvideo/data/blob/master/tools/youtube.py only explains how to scrape a playlist.

Daniel-at-github commented 6 years ago

Can you add instructions for how this works to the docs for future newbies like myself?

When youtube.py was created I made myself a hackish script to download the videos and never released it to not be redundant (and something of avoiding ugly code shame there). I wasn't aware that youtube.py only scraped lists. I could tidy up my script and publish it as second option.

Looks like everyone is happy with the Youtube scraping process!

Not so happy, it's a ton of work. I don't want to discourage you to get the raw data, only that with Pycon Us I want to be more responsive. And I'm a bit skeptic on getting the data (and I hope that I'm wrong), I tried to get the raw data for pydata.org #435 and talked with pinax/symposion to embedd a pyvideo record in it without luck. You are right when you say that Youtube scraping alone isn't a viable approach to shrink the backlog.

Other ways to get the data:

Summarizing: You are right @jonemo, this is not the efficient way, but to stay meaningful I think that Pycon Us have to be "fast". And I'm thankful that you want to make it more efficient and less work.

Daniel-at-github commented 6 years ago

Now working in the files:

elizabeth-wickes-hard-shouldn-t-be-hardship-supporting-absolute-novices-to-python-pycon-2018.json
...
irina-truong-adapting-from-spark-to-dask-what-to-expect-pycon-2018.json
Daniel-at-github commented 6 years ago

Now working in the files:

jack-diederich-howto-write-a-function-pycon-2018.json
...
mridul-seth-eric-ma-network-analysis-made-simple-part-ii-pycon-2018.json
redapple commented 6 years ago

My usual process when contributing to https://github.com/pyvideo/data

My thoughts on the quality of the metadata per talk:

Daniel-at-github commented 6 years ago

@jonemo seems that pycon should have a code in the web to introduce the video data in the schedule. If you want to go this route seems that is possible to add it to symposion (upstream conference project).

@redapple after finishing this issue I will tidy up my (unrelated) version of youtube.py and we could see how(/if?) to remix it. Scrapping is not my strong suit, merging nosql datasets... maybe jmespath is useful, and I was wondering if this was already done in one of the existing python ETL (seen openrefine but was searching for something python related). Didn't saw dedupe, have to take a look at it.

jonemo commented 6 years ago

From reading through the symposium source code I found this: https://us.pycon.org/2018/schedule/conference.json

Sadly, it seems like the video_url, slides_url, and asset_urls field remain unused. The conf_url field might be a nice addition for the related_urls list.

Daniel-at-github commented 6 years ago

@jonemo I will continue to push through manually, but when its done we could to merge the two datasets. To retrieve ´conf_url´ as related_url the data (I have to see what time zone is in) and the tutorial tag (field kind). But, as @redapple said, i think that this in't a trivial task:

what consumes most of my time when working on an issue for https://github.com/pyvideo/data is the joining of YouTube videos and talks metadata

Daniel-at-github commented 6 years ago

Now working in the files:

nathaniel-j-smith-trio-async-concurrency-for-mere-mortals-pycon-2018.json
...
russell-keith-magee-building-a-cross-platform-native-app-with-beeware-pycon-2018.json
Daniel-at-github commented 6 years ago

Now working in finishing the files.

Daniel-at-github commented 6 years ago

Done. Anyone to review?

Daniel-at-github commented 6 years ago

When youtube.py was created I made myself a hackish script to download the videos and never released it to not be redundant (and something of avoiding ugly code shame there). I wasn't aware that youtube.py only scraped lists. I could tidy up my script and publish it as second option.

I published the script in https://github.com/Daniel-at-github/pyvideo_scrape