recodehive / Scrape-ML

For new data generation Semi-supervised-sequence-learning-Project we have writtern a python script to fetch📊, data from the 💻, imdb website 🌐 and converted into txt files.
https://scrape-ml.streamlit.app/
MIT License
85 stars 116 forks source link

Exception handling and bug removal #211

Closed MYlab10 closed 2 months ago

MYlab10 commented 2 months ago

Related Issue

[using efficient data structure to reduce memory and add exceptio handling code]

Description

[Removed bug in code which caused OSError and PermissionError and added error handling code incase the directory already exists to prevent exception by adding code snippet: import os os.makedirs('data_scrapped', exist_ok=True) df.to_csv('data_scrapped/data_rotten_tomatoes.csv', index=False)

Also added additional exception handling blocks in case movie titles or reviews doesn't exist def getReviewText(review_url): '''Returns the user review text given the review soup.''' tag = review_url.find('p', attrs={'class': 'review-text'}) # Use select_one for efficient CSS selector if tag: return tag.get_text(strip=True) # Use strip=True to remove extra whitespace return None # Handle case where review text is not found

def getMovieTitle(review_url): '''Returns the movie title from the review soup.''' tag = review_url.find('title') if tag: title_tag = list(tag.children)[0].get_text() movie_title = title_tag.split(' - Movie Reviews | Rotten Tomatoes')[0] return movie_title return None # Handle case where title is not found

To use less memory use set instead of dict.fromkeys() to remove duplicates # remove duplicate links unique_movie_links = list(set(tag['href'] for tag in movie_tags))

To remove ModuleNotFoundError: No module named 'textblob' exception added pip install textblob]

Type of PR

Screenshots / videos (if applicable)

[Attach any relevant screenshots or videos demonstrating the changes] image

Checklist:

Additional context:

[I would also like to add more documentation to code snippets to help others understand code better]