simmeringratchet / LDSGeneralConferenceDownloader

A script to download General Conference talks for offline listneing
MIT License
13 stars 5 forks source link

Feature/fix url and add better ordering with numbers #3

Open GatorQue opened 4 years ago

GatorQue commented 4 years ago

Thank you for creating the General Conference Downloader tool. I fixed a few issues and added a new feature. Please accept this pull request or comment on what you would like me to change.

GatorQue commented 4 years ago

While testing the full range I ran into a problem trying to download MP3 files from 2016. I'm going to investigate and try to find a fix for this.

GatorQue commented 4 years ago

OK it should be fixed now, waiting to see how it does on older talks.

jdshaeffer commented 4 years ago

any update on this?

GatorQue commented 4 years ago

I haven't heard anything from the original author but I have been told that if you use this branch it works great.

clarkshaeffer commented 4 years ago

Hi, I'm experiencing a problem with your branch:

Problem with http request (https://www.churchofjesuschrist.org/languages: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)>The given language (eng) is not available. Please choose one of the following:

I run on Python 3.8.0 on MacOS 10.14.6. Anything helps! I need more of President Nelson in my life!!

GatorQue commented 4 years ago

Thank you for reaching out, I did a quick Google search on your error and came up with the following: https://stackoverflow.com/questions/22027418/openssl-python-requests-error-certificate-verify-failed Which suggests that you type this in: pip install certifi You might also search for the Install Certificates program described here: https://stackoverflow.com/questions/52805115/certificate-verify-failed-unable-to-get-local-issuer-certificate /Applications/Python\ 3.8/Install\ Certificates.command

The issue is that some of the "root" certificates on your computer are missing and unable to validate the SSL connection to the church's website. Installing these "root" certificates will enable you to run the program.

clarkshaeffer commented 4 years ago

Worked like a charm! Thanks!

rafaelmx commented 4 years ago

Hello, I'm a bit embarrassed that I have to ask this but... How do I run this script? I'm totally new to this but I'm super excited to get this working. So far I've done the following: 1) Installed Python 3.8 on Windows 10 2) Downloaded the original files and extracted them in a folder I created (C:\Users\rsanc\Documents\Rafa\LDS-GC) 3) Modified the three files GatorQue edited (I didn't know how to download them with the changes, so I made the changes one by one). And this is what I think I'm not doing correctly: 4) I opened a terminal and navigated to the folder with the extracted files of the script. 5) I typed on the terminal window python pip install -r requirements.txt and didn't get any message (previously I didn't use "python" at the beginning but got an error message, with no message I assumed I was doing it right) 6) I typed on the terminal window python gen_conf_downloader.py. Nothing. No message. I also tried python gen_conf_downloader.py -s 2018 -d C:\GC with no success.

Steps 4 through 6 were made both before and after modifying the files with the same results. Any help?

GatorQue commented 4 years ago

Hello! Welcome! I'm not 100% sure, but I suspect that your terminal window doesn't know how to find Python. Do you remember if you checked the "Add Python 3.8 to PATH" checkbox on the first screen? If not, can you try uninstalling Python 3.8 and reinstalling it again and make sure this checkbox is checked? I think once you do this the command "pip install -r requirements.txt" should work as expected (you will see a bunch of things downloaded and installed probably) and "python gen_conf_downloader.py -s 2018" should work.

rafaelmx commented 4 years ago

Hello! Welcome! I'm not 100% sure, but I suspect that your terminal window doesn't know how to find Python. Do you remember if you checked the "Add Python 3.8 to PATH" checkbox on the first screen? If not, can you try uninstalling Python 3.8 and reinstalling it again and make sure this checkbox is checked? I think once you do this the command "pip install -r requirements.txt" should work as expected (you will see a bunch of things downloaded and installed probably) and "python gen_conf_downloader.py -s 2018" should work.

Wow... it worked! Just as you imagined, I didn't check the "Add Python 3.8 to PATH" option, so I uninstalled it and installed it again. It worked perfectly. Thank you very much for your help. I'm impressed with the result.

One question, what will happen with the current files the next time I download new audios? For instance, I downloaded just from 2018 and 2019, what if I want to download from 2016? Will this script skip those already downloaded?

GatorQue commented 4 years ago

During the download the python script usually makes a cache of all the HTML pages it downloads. This enables it to avoid re-downloading those files again. As long as you don't remove the cache directory then I think it should work as you expect. I believe it does recreate the "play list" files though since those are usually affected. There are play list files created by topic, speaker, and session if I recall correctly.

Jacobobber1087 commented 7 months ago

There have been a few changes to the church website, is there any chance this gets an update?

GatorQue commented 7 months ago

@Jacobobber1087, I have been keeping this tool updated under my Github fork of this project. Have you given that a try? https://github.com/GatorQue/LDSGeneralConferenceDownloader/releases I use it for myself after every conference. If my version isn't working, I will be happy to look into it.

Jacobobber1087 commented 7 months ago

@GatorQue Oh ok, thank you! It seems to work, but the destination folder is empty after it completes, do you know what could cause this?

GatorQue commented 7 months ago

@Jacobobber1087, I see the same results. Let me look into what is causing this and post a new version. Something must have changed in the format of the HTML to prevent the program from working right.

GatorQue commented 7 months ago

@Jacobobber1087 - It seems that the church has hidden the MP3 download link behind the "Options" side panel which only seems to load when you click on the "Options" button (3 dots) and then click on the Download arrow. There is Javascript code which loads the Options side panel and the Download arrow loads the link somehow. I haven't found a good way to do that with my current way of doing things. I will need to see if I can find a Python based web browser that is capable of performing the Javascript commands needed to trigger the MP3 media link to appear in order to fix this. I will keep looking into this but it isn't going to be an easy fix like I was hoping.

Jacobobber1087 commented 7 months ago

@GatorQue Yeah, I was very curious how you were getting around the Javascript in previous versions of this haha... I ended up writing an automation in Microsoft Power Automate Desktop that uses Firefox to iterate through the sites and manually click to the download link. It technically worked but it took forever and was super clunky. Is there any way to interact with Javascript through a script that you know of?

GatorQue commented 7 months ago

@Jacobobber1087, Great question. In the recent past the MP3 URL could be found in the giant BASE64 content in the initial HTML download. This has changed at least sometime after October 2023. From my research today, I have found that if it is possible to execute the following javascript lines after the page loads it should provide a DOM that includes an element (the last one mentioned) whose href value is what we want for the MP3 file: document.querySelector('[title="Options"]').click() document.querySelector('button[data-testid="download-menu-button"]').click() document.querySelector("a[data-testid=\"download-link-0\"]").href

As far as tools are concerned, I have initially looked at splash, a docker image with Qt5 WebKit and a HTTP API for performing queries (usually paired with a Python scrapy-splash package). I have also discovered requests-html which uses a headless chromium install downloaded using the pyppeteer python package (but since that package has been abandoned the download fails). There is also a Python package Selenium that also uses a headless chromium to perform web scrapes which I haven't done anything with yet. I think if we can combine the above javascript lines somehow with a headless install of chromium, we might be able to retrieve the information we need. Another approach would be to identify WHAT/HOW the javascript downloads and modifies the DOM to create the "This Page (MP3)" download reference element when we click on the Options and Download arrows. Yet another approach might be to "predict" the media URL by guessing the filename that would be used from the information in the initial HTML but I suspect that might not be as stable (but certainly faster) approach. Thoughts?

Jacobobber1087 commented 7 months ago

@GatorQue Ok cool. I hadn't heard of a headless browser before, that seems like a really good solution. Would the browser need to be in the foreground? I assume not if you're sending requests through JS? Predicting the URL would be tricky, they use titles for some General Authorities (but not all) and you would have to know the mp3 bitrate. If this information is in the HTML that could work really well. How did you get the list of the links to each conference? I had to do that manually because of how the church groups the conferences on their website. I wonder if there is any way to access /assets/general-conference/ on the media2.ldscdn.org site? It doesn't allow a direct visit, maybe wget?

GatorQue commented 7 months ago

@Jacobobber1087, A headless browser means it doesn't provide a GUI/Window. This means requests must be sent some other way, usually through some REST api or other technique. For Splash it uses a custom REST api which allows for injecting some additional JavaScript commands to be processed after the page loads (which I haven't gotten to work fully yet). As far as getting the list of conferences, I perform a HTTP GET request for /study/general-conference and parse the HTML using several regular expressions to extract each conference, sessions, and talks into Python tuple objects. Feel free to look at the gen_conf_downloader.py file in my repository for more details. I will need to re-review the talk and conference HTML to see if enough information could be extracted to predict the media2 URL to use to get the MP3 file. As far as mp3 bitrate, we could just have it try a few different bitrates until it finds one that works. Unfortunately, there is no way to "browse" for a list of all files on the media2.ldscdn.org site that I have found. Perhaps there is a hidden index file that would give the complete list but I haven't seen evidence of this yet. The wget program wouldn't likely yield any different results against the media2 website. I did a wget against the talk and it practically started downloading all conference years and talks since they are all interlinked together so I gave up since I want people to be able to limit the conferences they wish to download. I didn't let it run long enough to see if the mp3 files could be discovered but I suspect it wouldn't because of the Javascript menu factor.

GatorQue commented 7 months ago

@Jacobobber1087, I am happy to report that using Selenium was successful in obtaining the media URL. The results are cached to a file, which is enabled by default now, such that future downloads will be faster. I am doing some more testing but should have an updated release posted soon.

Jacobobber1087 commented 4 months ago

@GatorQue Sorry for the late reply. I am currently serving as a missionary for the church so I do not have reliable access to a computer. I will look forward to the next release.