PyCon US 2018 - Githubissues

Daniel-at-github commented 6 years ago

May 9-17 Cleveland

 - title: PyCon US 2018 
   dir: pycon-us-2018
   youtube_list: https://www.youtube.com/channel/UCsX05-2sVSH7Nx3zuk3NYuQ/videos
   language: en
   related_urls:
   - label: schedule
     url: https://us.pycon.org/2018/schedule/talks/

jonemo commented 6 years ago

I asked on Twitter if they have the talk info in a structured format somewhere: https://twitter.com/jonemo/status/996113914743013376 🤞

logston commented 6 years ago

Hey @jonemo, that's great but our youtube scraping tool may do the trick.

Want to give it a go and let us know? https://github.com/pyvideo/data/blob/master/tools/youtube.py

jonemo commented 6 years ago

@logston: I've done the Youtube scraping route twice for smaller conferences and would rather not do it for a big event like Pycon (~100 talks + keynotes + lightning talks + tutorials). Frankly, I also don't think that Youtube scraping alone is a viable approach to shrink the backlog.

A few specific reasons why access to the data from the conference management system would be preferable over the Youtube data for Pycon:

The talks aren't even in a playlist yet (but I'm sure there is a nifty way to quickly create one)
Talk languages are not in the Youtube data (it looks like Pycon had some Spanish language talks this year)
Talk dates are not in the Youtube data
Speaker names have to be pulled out of talk titles or descriptions (easy to automate and easy to automate incorrectly for edge cases like multiple speakers and speaker names with special character)
Links to abstract pages are missing from Youtube and non-trivial to scrape and then correlate
Afaik the Speakerdeck links are available in the system, but not in the Youtube descriptions

I hope that explains why I decided to try asking for the raw data first. If there is no response or a negative response and nobody else has done the Youtube scraping yet, I'll probably do it on Wednesday night.

One more thought: My understanding is that several regional Pycon conferences use the same software for managing the talk schedule as Pycon US. If we can figure out a way to get the raw data from Pycon US, we might be able to use their process as playbook with the other conferences that use the same software.

Daniel-at-github commented 6 years ago

The talks aren't even in a playlist yet (but I'm sure there is a nifty way to quickly create one)

Not a problem to scrape it.

Talk languages are not in the Youtube data (it looks like Pycon had some Spanish language talks this year)

Yes. Manual labor without conference data.

Talk dates are not in the Youtube data

Yes. They are not perfect, but they are good enough. It's too much effort as it is.

Speaker names have to be pulled out of talk titles or descriptions (easy to automate and easy to automate incorrectly for edge cases like multiple speakers and speaker names with special character)

Done

Links to abstract pages are missing from Youtube and non-trivial to scrape and then correlate

Yes. Manual labor without conference data.

Afaik the Speakerdeck links are available in the system, but not in the Youtube description

Yes. Manual labor without conference data. Generic links are provided to ease access to it and (hopefully) future contributions.

Daniel-at-github commented 6 years ago

Done the "automatic" part of #495 Tomorrow will try to continue. If anyone wants to correct a chunk of the talks, just say what files.

logston commented 6 years ago

Thanks a bunch for you both of your input here @Daniel-at-github, @jonemo

Daniel-at-github commented 6 years ago

Now working in the files:

pycon-us-2018/videos/adam-englander-practical-api-security-pycon-2018.json
...
pycon-us-2018/videos/dustin-ingram-inside-the-cheeseshop-how-python-packaging-works-pycon-2018.json

jonemo commented 6 years ago

Apologies for extrapolating from my own experience to others. Looks like everyone is happy with the Youtube scraping process!

jonemo commented 6 years ago

The talks aren't even in a playlist yet (but I'm sure there is a nifty way to quickly create one)

Not a problem to scrape it.

@Daniel-at-github: Can you add instructions for how this works to the docs for future newbies like myself? Right now the readme at https://github.com/pyvideo/data/blob/master/tools/youtube.py only explains how to scrape a playlist.

Daniel-at-github commented 6 years ago

Can you add instructions for how this works to the docs for future newbies like myself?

When youtube.py was created I made myself a hackish script to download the videos and never released it to not be redundant (and something of avoiding ugly code shame there). I wasn't aware that youtube.py only scraped lists. I could tidy up my script and publish it as second option.

Looks like everyone is happy with the Youtube scraping process!

Not so happy, it's a ton of work. I don't want to discourage you to get the raw data, only that with Pycon Us I want to be more responsive. And I'm a bit skeptic on getting the data (and I hope that I'm wrong), I tried to get the raw data for pydata.org #435 and talked with pinax/symposion to embedd a pyvideo record in it without luck. You are right when you say that Youtube scraping alone isn't a viable approach to shrink the backlog.

Other ways to get the data:

crowdsourcing it somehow. Lowering the barrier to modify pyvideo data and pointing the users to the known data (schedules, slides repos, conference web, ...)
recruiting the source of the videos, like montreal python
lower the quality of the data or defer the edition (mark it as a draft?)
scrapping the schedules of the webs. If you get the data scrapping the schedules then you have two datasets: youtube + schedules. How you merge it? Any equivalent of SQL join for json?

Summarizing: You are right @jonemo, this is not the efficient way, but to stay meaningful I think that Pycon Us have to be "fast". And I'm thankful that you want to make it more efficient and less work.

Daniel-at-github commented 6 years ago

Now working in the files:

elizabeth-wickes-hard-shouldn-t-be-hardship-supporting-absolute-novices-to-python-pycon-2018.json
...
irina-truong-adapting-from-spark-to-dask-what-to-expect-pycon-2018.json

Daniel-at-github commented 6 years ago

Now working in the files:

jack-diederich-howto-write-a-function-pycon-2018.json
...
mridul-seth-eric-ma-network-analysis-made-simple-part-ii-pycon-2018.json

redapple commented 6 years ago

My usual process when contributing to https://github.com/pyvideo/data

scraping the conference website for talks details (and schedule with dates) is not too difficult, but that's probably only me and biased since I did quite a lot of scraping stuff while working for Scrapinghub and on https://github.com/scrapy/scrapy itself ; I tend to re-use and adapt the same base scraper for all the conference websites I've scraped for PyVideo. The ones that are easiest to adapt to are PyData's around the world. (I pushed some of my scrapers to https://github.com/redapple/pyvideo-contrib but it was really just for me to not lose them)
scraping the YouTube playlist : I've also adapted the youtube.py script a bit, mainly to add the duration ; I think we can safely add optional command line options for everyone for extra details we can grab from YouTube based on what I and @Daniel-at-github have used locally ; @Daniel-at-github, we should probably sync on that.
what consumes most of my time when working on an issue for https://github.com/pyvideo/data is the joining of YouTube videos and talks metadata : I have also been reusing a simple script to match using title and speakers data, but I'd really want to have a generic tool to do the matching. I've started a project to do record linkage in a visual way with https://github.com/dedupeio/dedupe in mind as the main workhorse, but I haven't got anything to show right now ; If anyone wants to join the effort, you can ping me. In a few words, the goal is:
- to push 2 structured file (JSON, CSV etc.),
- flatten them if necessary,
- align schemas,
- do a 1-to-1 matching if possible,
- build a combined dataset,
- choose columns to export This is definitely much broader scope than just PyVideo, but https://github.com/pyvideo/data provides a good use-case.
cleaning the merged dataset:
- converting descriptions to proper reST if possible (lists, links etc.): I've had quite some success with html2text+pandoc/Markdown-to-reST; I try to get that right at scraping time, and if not, I edit using a YAML conversion first
- proper speaker list: this is usually fine when the initial conference website scraping generated clean lists
- dates are another difficult one to get right, but it's good enough to use a single date by default (e.g. the first day of the conference)

My thoughts on the quality of the metadata per talk:

proper dates are not a big deal, except when the date is really far off from the real date ; this can "promote" a talk on pyvideo.org in the "latest events" section, that's how I've detected some strange dates
talk language is interesting to know, but I rarely put it in the metadata; except if the website provided it. Unless YouTube can provide the info, I think it's really time-consuming to get right. Language detection from description is also an option, but sometimes the description is written in English while the talk is in another.
the tags could be improved with some offline analysis of descriptions and titles, something along the lines of NLP's topic analysis maybe.

Daniel-at-github commented 6 years ago

@jonemo seems that pycon should have a code in the web to introduce the video data in the schedule. If you want to go this route seems that is possible to add it to symposion (upstream conference project).

@redapple after finishing this issue I will tidy up my (unrelated) version of youtube.py and we could see how(/if?) to remix it. Scrapping is not my strong suit, merging nosql datasets... maybe jmespath is useful, and I was wondering if this was already done in one of the existing python ETL (seen openrefine but was searching for something python related). Didn't saw dedupe, have to take a look at it.

jonemo commented 6 years ago

From reading through the symposium source code I found this: https://us.pycon.org/2018/schedule/conference.json

Sadly, it seems like the video_url, slides_url, and asset_urls field remain unused. The conf_url field might be a nice addition for the related_urls list.

Daniel-at-github commented 6 years ago

@jonemo I will continue to push through manually, but when its done we could to merge the two datasets. To retrieve ´conf_url´ as related_url the data (I have to see what time zone is in) and the tutorial tag (field kind). But, as @redapple said, i think that this in't a trivial task:

what consumes most of my time when working on an issue for https://github.com/pyvideo/data is the joining of YouTube videos and talks metadata

Daniel-at-github commented 6 years ago

Now working in the files:

nathaniel-j-smith-trio-async-concurrency-for-mere-mortals-pycon-2018.json
...
russell-keith-magee-building-a-cross-platform-native-app-with-beeware-pycon-2018.json

Daniel-at-github commented 6 years ago

Now working in finishing the files.

Daniel-at-github commented 6 years ago

Done. Anyone to review?

Daniel-at-github commented 6 years ago

When youtube.py was created I made myself a hackish script to download the videos and never released it to not be redundant (and something of avoiding ugly code shame there). I wasn't aware that youtube.py only scraped lists. I could tidy up my script and publish it as second option.

I published the script in https://github.com/Daniel-at-github/pyvideo_scrape

pyvideo / data

PyCon US 2018 #494