Error handling to scrape_journey and option to add url

neomatrix369 / learning-path-index

A repo with data files, assets and code supporting and powering the Learning Path Index Project

MIT License

15 stars 16 forks source link

This pull request has 2 updates:

1) When using scrape_journey out of the box I get an Index error for the details and link sections. This update includes error handling to prevent the error

Traceback (most recent call last):
  File "C:\Users\avird\anaconda3\envs\lpi\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\avird\anaconda3\envs\lpi\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\PillView\learning-path-index\app\course-scraper\src\scrapers\google_cloud_skill_boost\scrape_journey.py", line 46, in <module>
    ml_learning_path = extract_ml_learning_path()
  File "C:\PillView\learning-path-index\app\course-scraper\src\scrapers\google_cloud_skill_boost\scrape_journey.py", line 30, in extract_ml_learning_path
    "details": journey.xpath(pages.GCSBLearningJourneyPage.journey_details)[
IndexError: list index out of range

(lpi) C:\PillView\learning-path-index\app\course-scraper\src>python -m scrapers.google_cloud_skill_boost.scrape_journey
Traceback (most recent call last):
  File "C:\Users\avird\anaconda3\envs\lpi\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\avird\anaconda3\envs\lpi\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\PillView\learning-path-index\app\course-scraper\src\scrapers\google_cloud_skill_boost\scrape_journey.py", line 46, in <module>
    ml_learning_path = extract_ml_learning_path()
  File "C:\PillView\learning-path-index\app\course-scraper\src\scrapers\google_cloud_skill_boost\scrape_journey.py", line 30, in extract_ml_learning_path
    "details": journey.xpath(pages.GCSBLearningJourneyPage.journey_details)[
IndexError: list index out of range

2) Create the option to enter the training URL when running this command ie when you run python -m scrapers.scrape_journey you get an entry to enter the URL Please enter the GCSB Journey URL:

Summary by Sourcery

Implement error handling in the scrape_journey function to prevent IndexError and add functionality for user input of the GCSB Journey URL.

Bug Fixes:

Add error handling to prevent IndexError in the scrape_journey function when accessing journey details and links.

Enhancements:

Introduce user input for the GCSB Journey URL, allowing dynamic URL entry when running the scrape_journey script.

Reviewer's Guide by Sourcery

This pull request implements error handling for the scrape_journey function and adds an option for users to input the URL when running the script. The changes focus on improving the robustness of the data extraction process and enhancing user interaction.

File-Level Changes

Change	Details	Files
Implemented error handling for data extraction	Added try-except blocks to handle potential IndexErrors when extracting journey details and links Provided default values for cases where data is missing or cannot be extracted Improved robustness of title and description extraction with fallback values	`app/course-scraper/src/scrapers/google_cloud_skill_boost/scrape_journey.py`
Added user input for GCSB Journey URL	Implemented a prompt for users to enter the GCSB Journey URL when running the script Modified the main execution block to use the user-provided URL	`app/course-scraper/src/scrapers/google_cloud_skill_boost/scrape_journey.py`
Improved CSV writing process	Added a check to ensure data is not empty before writing to CSV Implemented error handling for the CSV writing process Added feedback messages for successful writing and error cases	`app/course-scraper/src/scrapers/google_cloud_skill_boost/scrape_journey.py`
Refactored function signature and execution flow	Modified extract_ml_learning_path function to accept GCSB_JOURNEY_URL as a parameter Moved the main execution logic into an if name == "main": block Removed global variable usage for ml_learning_path	`app/course-scraper/src/scrapers/google_cloud_skill_boost/scrape_journey.py`

Change

Details

Files

Implemented error handling for data extraction

Added try-except blocks to handle potential IndexErrors when extracting journey details and links
Provided default values for cases where data is missing or cannot be extracted
Improved robustness of title and description extraction with fallback values

app/course-scraper/src/scrapers/google_cloud_skill_boost/scrape_journey.py

Added user input for GCSB Journey URL

Implemented a prompt for users to enter the GCSB Journey URL when running the script
Modified the main execution block to use the user-provided URL

app/course-scraper/src/scrapers/google_cloud_skill_boost/scrape_journey.py

Improved CSV writing process

Added a check to ensure data is not empty before writing to CSV
Implemented error handling for the CSV writing process
Added feedback messages for successful writing and error cases

app/course-scraper/src/scrapers/google_cloud_skill_boost/scrape_journey.py

Refactored function signature and execution flow

Modified extract_ml_learning_path function to accept GCSB_JOURNEY_URL as a parameter
Moved the main execution logic into an if name == "main": block
Removed global variable usage for ml_learning_path

app/course-scraper/src/scrapers/google_cloud_skill_boost/scrape_journey.py

Tips

- Trigger a new Sourcery review by commenting `@sourcery-ai review` on the pull request. - Continue your discussion with Sourcery by replying directly to review comments. - You can change your review settings at any time by accessing your [dashboard](https://app.sourcery.ai): - Enable or disable the Sourcery-generated pull request summary or reviewer's guide; - Change the review language; - You can always [contact us](mailto:support@sourcery.ai) if you have any questions or feedback.

neomatrix369 / learning-path-index