Closed Daniel-at-github closed 6 years ago
I asked on Twitter if they have the talk info in a structured format somewhere: https://twitter.com/jonemo/status/996113914743013376 🤞
Hey @jonemo, that's great but our youtube scraping tool may do the trick.
Want to give it a go and let us know? https://github.com/pyvideo/data/blob/master/tools/youtube.py
@logston: I've done the Youtube scraping route twice for smaller conferences and would rather not do it for a big event like Pycon (~100 talks + keynotes + lightning talks + tutorials). Frankly, I also don't think that Youtube scraping alone is a viable approach to shrink the backlog.
A few specific reasons why access to the data from the conference management system would be preferable over the Youtube data for Pycon:
I hope that explains why I decided to try asking for the raw data first. If there is no response or a negative response and nobody else has done the Youtube scraping yet, I'll probably do it on Wednesday night.
One more thought: My understanding is that several regional Pycon conferences use the same software for managing the talk schedule as Pycon US. If we can figure out a way to get the raw data from Pycon US, we might be able to use their process as playbook with the other conferences that use the same software.
- The talks aren't even in a playlist yet (but I'm sure there is a nifty way to quickly create one)
Not a problem to scrape it.
- Talk languages are not in the Youtube data (it looks like Pycon had some Spanish language talks this year)
Yes. Manual labor without conference data.
- Talk dates are not in the Youtube data
Yes. They are not perfect, but they are good enough. It's too much effort as it is.
- Speaker names have to be pulled out of talk titles or descriptions (easy to automate and easy to automate incorrectly for edge cases like multiple speakers and speaker names with special character)
Done
- Links to abstract pages are missing from Youtube and non-trivial to scrape and then correlate
Yes. Manual labor without conference data.
- Afaik the Speakerdeck links are available in the system, but not in the Youtube description
Yes. Manual labor without conference data. Generic links are provided to ease access to it and (hopefully) future contributions.
Done the "automatic" part of #495 Tomorrow will try to continue. If anyone wants to correct a chunk of the talks, just say what files.
Thanks a bunch for you both of your input here @Daniel-at-github, @jonemo
Now working in the files:
pycon-us-2018/videos/adam-englander-practical-api-security-pycon-2018.json
...
pycon-us-2018/videos/dustin-ingram-inside-the-cheeseshop-how-python-packaging-works-pycon-2018.json
Apologies for extrapolating from my own experience to others. Looks like everyone is happy with the Youtube scraping process!
The talks aren't even in a playlist yet (but I'm sure there is a nifty way to quickly create one)
Not a problem to scrape it.
@Daniel-at-github: Can you add instructions for how this works to the docs for future newbies like myself? Right now the readme at https://github.com/pyvideo/data/blob/master/tools/youtube.py only explains how to scrape a playlist.
Can you add instructions for how this works to the docs for future newbies like myself?
When youtube.py
was created I made myself a hackish script to download the videos and never released it to not be redundant (and something of avoiding ugly code shame there).
I wasn't aware that youtube.py
only scraped lists.
I could tidy up my script and publish it as second option.
Looks like everyone is happy with the Youtube scraping process!
Not so happy, it's a ton of work. I don't want to discourage you to get the raw data, only that with Pycon Us I want to be more responsive. And I'm a bit skeptic on getting the data (and I hope that I'm wrong), I tried to get the raw data for pydata.org #435 and talked with pinax/symposion to embedd a pyvideo record in it without luck. You are right when you say that Youtube scraping alone isn't a viable approach to shrink the backlog.
Other ways to get the data:
Summarizing: You are right @jonemo, this is not the efficient way, but to stay meaningful I think that Pycon Us have to be "fast". And I'm thankful that you want to make it more efficient and less work.
Now working in the files:
elizabeth-wickes-hard-shouldn-t-be-hardship-supporting-absolute-novices-to-python-pycon-2018.json
...
irina-truong-adapting-from-spark-to-dask-what-to-expect-pycon-2018.json
Now working in the files:
jack-diederich-howto-write-a-function-pycon-2018.json
...
mridul-seth-eric-ma-network-analysis-made-simple-part-ii-pycon-2018.json
My usual process when contributing to https://github.com/pyvideo/data
scraping the conference website for talks details (and schedule with dates) is not too difficult, but that's probably only me and biased since I did quite a lot of scraping stuff while working for Scrapinghub and on https://github.com/scrapy/scrapy itself ; I tend to re-use and adapt the same base scraper for all the conference websites I've scraped for PyVideo. The ones that are easiest to adapt to are PyData's around the world. (I pushed some of my scrapers to https://github.com/redapple/pyvideo-contrib but it was really just for me to not lose them)
scraping the YouTube playlist : I've also adapted the youtube.py
script a bit, mainly to add the duration ; I think we can safely add optional command line options for everyone for extra details we can grab from YouTube based on what I and @Daniel-at-github have used locally ; @Daniel-at-github, we should probably sync on that.
what consumes most of my time when working on an issue for https://github.com/pyvideo/data is the joining of YouTube videos and talks metadata : I have also been reusing a simple script to match using title and speakers data, but I'd really want to have a generic tool to do the matching. I've started a project to do record linkage in a visual way with https://github.com/dedupeio/dedupe in mind as the main workhorse, but I haven't got anything to show right now ; If anyone wants to join the effort, you can ping me. In a few words, the goal is:
cleaning the merged dataset:
My thoughts on the quality of the metadata per talk:
@jonemo seems that pycon should have a code in the web to introduce the video data in the schedule. If you want to go this route seems that is possible to add it to symposion (upstream conference project).
@redapple after finishing this issue I will tidy up my (unrelated) version of youtube.py
and we could see how(/if?) to remix it.
Scrapping is not my strong suit, merging nosql datasets... maybe jmespath is useful, and I was wondering if this was already done in one of the existing python ETL (seen openrefine but was searching for something python related). Didn't saw dedupe, have to take a look at it.
From reading through the symposium source code I found this: https://us.pycon.org/2018/schedule/conference.json
Sadly, it seems like the video_url
, slides_url
, and asset_urls
field remain unused. The conf_url
field might be a nice addition for the related_urls
list.
@jonemo I will continue to push through manually, but when its done we could to merge the two datasets. To retrieve ´conf_url´ as related_url
the data (I have to see what time zone is in) and the tutorial tag (field kind
).
But, as @redapple said, i think that this in't a trivial task:
what consumes most of my time when working on an issue for https://github.com/pyvideo/data is the joining of YouTube videos and talks metadata
Now working in the files:
nathaniel-j-smith-trio-async-concurrency-for-mere-mortals-pycon-2018.json
...
russell-keith-magee-building-a-cross-platform-native-app-with-beeware-pycon-2018.json
Now working in finishing the files.
Done. Anyone to review?
When youtube.py was created I made myself a hackish script to download the videos and never released it to not be redundant (and something of avoiding ugly code shame there). I wasn't aware that youtube.py only scraped lists. I could tidy up my script and publish it as second option.
I published the script in https://github.com/Daniel-at-github/pyvideo_scrape
May 9-17 Cleveland