iamkroot / ilc-scraper

A scraper for Impartus Lecture Capture videos for BITS Hyderabad
67 stars 30 forks source link

[BUG] Destination Path issue on windows when subjects name have "/ " in it. #56

Closed brahma19 closed 3 years ago

brahma19 commented 3 years ago

Context

Additional context Subject name contains "/" as it is common between two tracks. "Financial Analytics (S2-20_PDBACZG517 / S2-20_PDFTCZG517)"

Describe the bug Code is deriving the destination folder name from the subject name and since it has forward slash in the name it is breaking. https://github.com/iamkroot/ilc-scraper/blob/master/ilc_scrape.py#L277

Screenshots If applicable, add screenshots to help explain your problem.

Program output

Click to see output

``` Couldn't find C:\Users\<>\Downloads\imp_config.json. Will skip for now. Traceback (most recent call last): File "pathlib.py", line 1273, in mkdir FileNotFoundError: [WinError 3] The system cannot find the path specified: 'C:\\Users\\<>\\Documents\\WILP\\lectures\\Financial Analytics (S2-20_PDBACZG517 \\ S2-20_PDFTCZG517) WILP Session' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "ilc_scrape.py", line 347, in File "ilc_scrape.py", line 278, in main File "pathlib.py", line 1278, in mkdir File "pathlib.py", line 1273, in mkdir FileNotFoundError: [WinError 3] The system cannot find the path specified: 'C:\\Users\\<>\\Documents\\WILP\\lectures\\Financial Analytics (S2-20_PDBACZG517 \\ S2-20_PDFTCZG517) WILP Session' ```

brahma19 commented 3 years ago

Will it work without breaking anything else

working_dir: Path = args.dest / subject_name.replace("/","_") ?

https://github.com/iamkroot/ilc-scraper/blob/master/ilc_scrape.py#L277

iamkroot commented 3 years ago

It would be better to just do the change in subject name itself, and working_dir will automatically become correct.

brahma19 commented 3 years ago

Sure, let me know if below are ok for unit testing? will open PR. `>>> import json

x = '{"subjectName": "Financial Analytics (S2-20_PDBACZG517 / S2-20_PDFTCZG517)","sessionName": "WILP Session"}' y = json.loads(x) print(y) {'subjectName': 'Financial Analytics (S2-20_PDBACZG517 / S2-20_PDFTCZG517)', 'sessionName': 'WILP Session'} lectures = y subject_name = "{subjectName} {sessionName}".format(lectures) print(subject_name) Financial Analytics (S2-20_PDBACZG517 / S2-20_PDFTCZG517) WILP Session subject_name = "{subjectName} {sessionName}".format(lectures).replace("/","_") print(subject_name) Financial Analytics (S2-20PDBACZG517 S2-20_PDFTCZG517) WILP Session`

iamkroot commented 3 years ago

That works file, but I would prefer just using an exiting implementation of the same. Since we need to consider that there are many other invalid characters for Windows. Just google "file path sanitizer", pick a good one, and add it to utils.py

brahma19 commented 3 years ago

Looks like this https://pypi.org/project/pathvalidate/ is already doing what is needed here.

brahma19 commented 3 years ago

Using pathvalidate: The default target platform is universal. i.e. the sanitized file name is valid for any platform. Let me know if its ok. `└─$ python3

import json from pathvalidate import sanitize_filename as cleanpath x = '{"subjectName": "Financial Analytics (S2-20_PDBACZG517 / S2-20_PDFTCZG517)","sessionName": "WILP Session"}' y = json.loads(x) print(y) {'subjectName': 'Financial Analytics (S2-20_PDBACZG517 / S2-20_PDFTCZG517)', 'sessionName': 'WILP Session'} lectures = y subject_name = cleanpath("{subjectName} {sessionName}".format(**lectures)) print(subject_name) Financial Analytics (S2-20_PDBACZG517 S2-20_PDFTCZG517) WILP Session `

adding another test case with additional characters cleaning PS ";" is allowed in the folder name on windows: `>>> x = '{"subjectName": "Financial Analytics (S2-20_PDBACZG517 ? ; : \\ S2-20_PDFTCZG517)","sessionName": "WILP Session"}'

y = json.loads(x) print(y) {'subjectName': 'Financial Analytics (S2-20_PDBACZG517 ? ; : \ S2-20_PDFTCZG517)', 'sessionName': 'WILP Session'} lectures = y subject_name = cleanpath("{subjectName} {sessionName}".format(**lectures)) print(subject_name) Financial Analytics (S2-20_PDBACZG517 ; S2-20_PDFTCZG517) WILP Session `

iamkroot commented 3 years ago

That seems fine, but I'd prefer not renaming the import. The original name is much clearer. Please go ahead and make a PR! :)

brahma19 commented 3 years ago

Sure, was renaming to avoid ambiguity as the method says "sanitize_filename" but this use case is to sanitize the path. PR is available at https://github.com/iamkroot/ilc-scraper/pull/57

brahma19 commented 3 years ago

@iamkroot I have verified the fix and able to download the lectures now. PFB the snapshot for reference. Please proceed with version upgrade and release. image

iamkroot commented 3 years ago

Here's the built binary: https://github.com/iamkroot/ilc-scraper/suites/2922654366/artifacts/65597225 Please verify that this is also working properly.

brahma19 commented 3 years ago

Standalone Executable is working as expected. image