Re-downloading updated courses results to download all lessons even if already been downloaded

r0oth3x49 / udemy-dl

A cross-platform python based utility to download courses from udemy for personal offline use.

MIT License

4.85k stars 1.19k forks source link

Re-downloading updated courses results to download all lessons even if already been downloaded #414

Closed travis-south closed 4 years ago

travis-south commented 5 years ago

So I have already downloaded a course last month and say for example the course got updated yesterday and I wanted to download the updates. When I did try to download it, it was correctly identifying which lessons are already downloaded but when it came to the newly added lesson, it downloaded it but instead of just downloading that new lesson, it downloaded all other lessons after it.

I think this happens because the comparison of the already downloaded lessons and new lessons are just via filename and since there's a number prefix on the downloaded lessons, it will bump up the number starting from the new lesson added 'til end.

Not sure what's the best approach on how this can be avoided/fixed. Probably when doing filename/title comparison remove the prefix?

Thanks.

r0oth3x49 commented 5 years ago

@travis-south i will fix the issue after reproducing

teamcrisis commented 4 years ago

I'm having this same issue. Seems related to the number prefixing.

Maybe prefix with the format [chapter]-[auto increment]?

r0oth3x49 commented 4 years ago

I'm having this same issue. Seems related to the number prefixing.

Maybe prefix with the format [chapter]-[auto increment]?

This idea seems good to me will check thanks for the suggestion.

Nightreaver commented 4 years ago

Or just number every lesson per chapter from 001 again

smart-lemon commented 4 years ago

what I did is use a script to download and remove duplicates

PATH_OF_DATA = "/Volumes/PenDriveWithCourses/"

CMD_PRE = "/usr/local/bin/python3 /Users/YourWsp/Documents/GIT/udemy-dl/udemy-dl.py https://yourcompanysthing.udemy.com/"
CMD_POST = "/ -k /Users/YourWsp/Documents/cookies.txt --skip-sub -q 480 -o "

CMD_CLEANUP = "fdupes -N -i -r -d "
# Go thru the drive and iterate thru the downloaded folder list

for foldername in os.listdir(PATH_OF_DATA):
    print("+++++++++++++++++++++++++++ Directory is " + foldername + " ++++++++++++++++++++++++++++++++++++++")
    CMD = CMD_PRE + foldername + CMD_POST + PATH_OF_DATA
    os.system(CMD)

    # Remove duplicates
    CMD_NEXT = CMD_CLEANUP + PATH_OF_DATA + foldername + "/"
    os.system(CMD_NEXT)

Once a couple of months I check for updates and delete the old (duplicate) content

smart-lemon commented 4 years ago

I have another solution to it:

in _shared.py use this function to check if the file is already downloaded :

def check_if_already_exists(filepath):
    dirname = os.path.dirname(filepath)
    dlfilename = os.path.basename(filepath)

    shortened_filename = dlfilename[4:]
    for filename in os.listdir(dirname):
        local_filename = filename[4:]
        if local_filename == shortened_filename:
            return True

    print("Not found : " + shortened_filename + ", Downloading ...")
    return False

Usage : Just before downloading it check if the file actually exists

+ if check_if_already_exists(filepath):
+          retVal = {"status" : "True", "msg" : "already downloaded, with a different name"}
+          return retVal

 if os.path.isfile(filepath):
            retVal = {"status": "True", "msg": "already downloaded"}
            return retVal

The use the fdupes tool to delete dupes

fdupes -N -i -r -d /path/to/files

It's lame but it works. The right way would be to compare the file sizes as well (not just the name)

Nightreaver commented 4 years ago

well, your solution has issues, a teacher can reupload a new file under old name, and you will miss it. From what i have seen, udemy currently doesnt deliver any hashes, so a better solution would be, to create a "manifest" and store lecture counts and maybe hashes to that file, and check if any chapter has been altered, then redownload the files and check if they match.

But even that would have to re-download all lectures to make sure you dont have old ones. but it would be easier to check for dupes right away.

r0oth3x49 commented 4 years ago

i have some ideas in my mind i will check and push the updates soon.

xd003 commented 4 years ago

Does this issue still exist ?

Nightreaver commented 4 years ago

yes

dportabella commented 4 years ago

in coursera-dl they have a "--resume" option. Maybe you can reuse the same approach: https://github.com/coursera-dl/coursera-dl#resuming-downloads

r0oth3x49 commented 4 years ago

in coursera-dl they have a "--resume" option. Maybe you can reuse the same approach: https://github.com/coursera-dl/coursera-dl#resuming-downloads

resume capability is already there.

issue is with when a course chapter or a lecture gets updated in anyway (rename/new video file with same name/chapter addition).

i have managed to tackle issue with new chapter/lecture addition. but for existing one the (rename/new video with same name) i 'm checking if we can implement it.

will check udemy api as well if they provide some sort of dates where it says this thing is got updated etc..

r0oth3x49 commented 4 years ago

@all i will re-open the issue when i plan to work on it with some tricks currently i don't see anything that udemy api provides which can be use to keep track of videos/chapters/course updated or not. I 'm closing the issue as future enhancement and will see if i can implement some sort of fix on top of this issue in the mean while PR and suggestions are also welcome.