For new data generation Semi-supervised-sequence-learning-Project we have writtern a python script to fetch📊, data from the 💻, imdb website 🌐 and converted into txt files.
[using efficient data structure to reduce memory and add exceptio handling code]
Description
[Removed bug in code which caused OSError and PermissionError and added error handling code incase the directory already exists to prevent exception by adding code snippet: import os
os.makedirs('data_scrapped', exist_ok=True)
df.to_csv('data_scrapped/data_rotten_tomatoes.csv', index=False)
Also added additional exception handling blocks in case movie titles or reviews doesn't exist def getReviewText(review_url):
'''Returns the user review text given the review soup.'''
tag = review_url.find('p', attrs={'class': 'review-text'}) # Use select_one for efficient CSS selector
if tag:
return tag.get_text(strip=True) # Use strip=True to remove extra whitespace
return None # Handle case where review text is not found
def getMovieTitle(review_url):
'''Returns the movie title from the review soup.'''
tag = review_url.find('title')
if tag:
title_tag = list(tag.children)[0].get_text()
movie_title = title_tag.split(' - Movie Reviews | Rotten Tomatoes')[0]
return movie_title
return None # Handle case where title is not found
To use less memory use set instead of dict.fromkeys() to remove duplicates # remove duplicate links
unique_movie_links = list(set(tag['href'] for tag in movie_tags))
To remove ModuleNotFoundError: No module named 'textblob' exception added pip install textblob]
Type of PR
[1 ] Bug fix
[ 1] Feature enhancement
[ ] Documentation update
[ ] Other (specify): ___
Screenshots / videos (if applicable)
[Attach any relevant screenshots or videos demonstrating the changes]
Checklist:
[X ] I have performed a self-review of my code
[X ] I have read and followed the Contribution Guidelines.
[X ] I have tested the changes thoroughly before submitting this pull request.
[X ] I have provided relevant issue numbers, screenshots, and videos after making the changes.
[ X] I have commented my code, particularly in hard-to-understand areas.
Additional context:
[I would also like to add more documentation to code snippets to help others understand code better]
Related Issue
[using efficient data structure to reduce memory and add exceptio handling code]
Description
[Removed bug in code which caused OSError and PermissionError and added error handling code incase the directory already exists to prevent exception by adding code snippet: import os os.makedirs('data_scrapped', exist_ok=True) df.to_csv('data_scrapped/data_rotten_tomatoes.csv', index=False)
Also added additional exception handling blocks in case movie titles or reviews doesn't exist def getReviewText(review_url): '''Returns the user review text given the review soup.''' tag = review_url.find('p', attrs={'class': 'review-text'}) # Use select_one for efficient CSS selector if tag: return tag.get_text(strip=True) # Use strip=True to remove extra whitespace return None # Handle case where review text is not found
def getMovieTitle(review_url): '''Returns the movie title from the review soup.''' tag = review_url.find('title') if tag: title_tag = list(tag.children)[0].get_text() movie_title = title_tag.split(' - Movie Reviews | Rotten Tomatoes')[0] return movie_title return None # Handle case where title is not found
To use less memory use set instead of dict.fromkeys() to remove duplicates # remove duplicate links unique_movie_links = list(set(tag['href'] for tag in movie_tags))
To remove ModuleNotFoundError: No module named 'textblob' exception added pip install textblob]
Type of PR
Screenshots / videos (if applicable)
[Attach any relevant screenshots or videos demonstrating the changes]
Checklist:
Additional context:
[I would also like to add more documentation to code snippets to help others understand code better]