Temporal filename "yt_videos_list_temp.txt"

shailshouryya / yt-videos-list

Create and **automatically** update a list of all videos on a YouTube channel (in txt/csv/md form) via YouTube bot with end-to-end web scraping - no API tokens required. Multi-threaded support for YouTube videos list updates.

Apache License 2.0

107 stars 20 forks source link

Temporal filename "yt_videos_list_temp.txt" #7

Closed tfmotu closed 3 years ago

tfmotu commented 3 years ago

Hi, firstly thanks for this tool. Previously I used other well-known tool to download this information, it was very fast, it was good, but as you probably know it does not work these days... yt_videos_list fits well in my environment. Well what is the issue (perhaps it is the word). I have a list of youtube channels and users, and I need to download all video urls from these channels and users. I use multiprocessing in order to reduce the execution time. The "problem" is that the temporal file name is always the same, yt_videos_list_temp.*, and if you execute multiple threads (in parallel) the are errors... Of course I can do other things to avoid this but I think that can be good if this temporal filename is based to the youtube channel id or youtube user id. Again thanks.

shailshouryya commented 3 years ago

Hello, thanks for filing an issue!

I actually ran into this problem recently as well but didn't think of changing it since I wasn't sure if it would be useful, but with the use case you brought up I think it's definitely a good change (and also one I didn't think of).

I'll push a fix shortly that should fix it, and then I'll merge it into the master branch if you think it looks okay. 🙂

shailshouryya commented 3 years ago

Updated the code to create the temp file using the YouTube channel's name as the temporary file name instead of using yt_videos_list_temp.extension.

Can you retry the multiprocessing code you have using the updated changes by running the following? @asiergda

git clone https://github.com/Shail-Shouryya/yt_videos_list.git
cd yt_videos_list\python  # Windows
cd yt_videos_list/python  # MacOS/Linux
git checkout issue7
pip install .    # Windows
python           # Windows
pip3 install .   # MacOS/Linux
python3          # MacOS/Linux

from yt_videos_list import ListCreator

lc = ListCreator()
lc.create_list_for(url='the_channel_url')

If you're interested, you can compare the current changes to the code before here, and/or if there's anything you'd like to add go ahead and make a pull request for that branch - or just add a comment here!

Please let me know if this update fixes the problem, and if it doesn't, I'll work on fixing any further problems as soon as possible. 🤓

EDIT: see #8 for the changes - merging the changes into master leads to no difference in the comparison at the comparison link mentioned earlier in this comment

tfmotu commented 3 years ago

HI Shail-Shouryya, Thanks for the reply. I think that it is OK: " ...snip No new videos were found since the last scroll. Waiting another 1.0 seconds to see if more videos can be loaded.... Reached end of page! It took 45.21135544800006 seconds to find all 731 videos from https://www.youtube.com/channel/blablabla/videos

Opening a temp txt file and writing NEW video information to the file.... Finished writing to temp_blablabla.txt 0 NEW videos written to temp_blablabla.txt Closing temp_blablabla.txt Successfully completed write, renamed temp_blablabla.txt to temp_blablabla.txt It took 53.34290599799999 seconds to write the 0 NEW videos to the pre-existing temp_blablabla.txt

This program took 118.052043777 seconds to complete. ...snip " I will test more deeply with all the channels and users. In this case I only use 2 channels and 2 users, but as I said I think that it is correct. If all works well I will not use the workaround anymore...As a workaround the actual script create a folder for specific channel and user, then copies to that directory the needed info (scripts/functions). Then from that folder the script executes the code. These actions using 4 threads. Thank you!

shailshouryya commented 3 years ago

Awesome, happy I could help simplify it!

The workaround you mentioned makes sense, but I'm glad you pointed out this problem since changing the output file name was a pretty simple change in the back end and should help avoid naming clashes if anyone decides to run more than 1 instance of this at the same time.

I'll leave this issue open just in case something doesn't work properly or some other related problem pops up, and you can update the issue with any successes/failures after you test them.

Can you add your test results (it worked, didn't work, something else happened) here after you test them? Feel free to take your time 🙂

tfmotu commented 3 years ago

Hi Shail-Shouryya, sadly if I use the same folder and multithreading sometimes the script fails. Here goes an error: ...snip selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: //yt-formatted-string[@class='style-scope ytd-channel-name'] ...snip

But as I said I have resolved this needed with other focus, creating a folder per channel or user and executing the code in these folders using multihtreading. In that case the script never has failed.

The numbers: In order to do this job I use a raspberry-pi 4B (4GB RAM). The geckodriver must be build for the arm, there is no actual version for arm architectures.

geckodriver version 0.28.0
Mozilla Firefox 78.4.1esr

At this time the application download video information from almost one hundred of channels/users, and the video info processed from the cahnnels/users is more than 31 thousands. The script/application takes one hour and a half to complete the job. These are the exact numbers of last execution: Channels/Users: 89 Video Info: 31806 Time: 5466.3 sec

Thanks! PD: Please close this enhancement if you want, to me it is OK.

shailshouryya commented 3 years ago

Hi @asiergda,

Sorry for the long delay, but I updated the package with bugfixes for the Issues filed recently (#3 & #4) along with significantly more robust threading support in release 0.5.0, so try updating the package and seeing if that also helps! 🤓

pip3 install -U yt_videos_list # macOS
pip  install -U yt_videos_list # Windows

Regarding your specific error

selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: //yt-formatted-string[@Class='style-scope ytd-channel-name']

I never ran into that error, and I tested ~160-170 times

with each test running on the same channel on 2 threads
with one thread running with reverse_chronological=True
and the other with reverse_chronological=False

Based on the other similar problems (such as here), that error pops up when the page hasn't loaded all elements before your program tries to find an element. The yt_videos_list program doesn't explicitly have a wait timer for elements to load, but perhaps if you still get the error with the new 0.5.0 release I'll add that in (looking at this issue again now, I think I should have done that in 0.5.0 itself 😅).

To see the support the package now provides for multi-threading, take a look at the Scraping multiple channels from a file simultaneously with multi-threading section in the Python README and see if that helps!

To get a more detailed look at everything 0.5.0 covers, take a look at the 0.5.0 release page.

I realize this took a long time to get updated, but I was testing to make sure the multi-threading support was robust and the logging was more descriptive. The Python README also has significantly more info, so take a look at everything else there too!

shailshouryya commented 3 years ago

UPDATE:

addressed the selenium.common.exceptions.NoSuchElementException exception in release 0.5.1.

The program now waits for elements on the YouTube channel to load before proceeding, so if the channel doesn't load the elements in 9 seconds (time limit for waiting), the program sys.exit()s out. This should directly address the error you were seeing! There's no real workaround for this since YouTube might throttle your IP address if it detects constant automated activity, so you might just need to wait a bit before rerunning the program.

All the changes made in release 0.5.0 to explicitly check for and avoid threading errors should also help!

tfmotu commented 3 years ago

Updated, thanks!

shailshouryya commented 3 years ago

Closing this issue since no further problems seem to have popped up in the past 2 weeks. @asiergda please don't hesitate to reopen this issue or file a new issue if another problem comes up!