samedtorunn / opinion_mining

Opinion Mining Tool for SWE599
0 stars 0 forks source link

Increase the post numbers to be analyzed in the data collection step #42

Open samedtorunn opened 1 year ago

samedtorunn commented 1 year ago

In the demo, there was only a few posts for sentiment analysis even though there are a lot of posts about the queried keyword. Check if there is a problem with fetching mechanism, and get the all posts that are related to analysis.

samedtorunn commented 1 year ago

Reddit's Python Reddit API Wrapper (PRAW) has certain limitations that have been affecting the efficiency of our data extraction process:

Post Fetching Limit: PRAW allows fetching only 100 posts at a time. If the script is run twice, it returns the same set of 100 posts, without an option to get a different set.

Request Rate Limit: PRAW allows a maximum of 30 requests per minute. This also hampers our ability to fetch more data within a short period.

No Specific Time Filtered Search: Unfortunately, PRAW does not provide an option for a specific time filtered search. This means the system has to first fetch 100 posts and then apply filters, resulting in inefficiencies and potential data loss.

A possible workaround could involve the use of an alternative API called PushShift, which seems to be less restrictive:

Higher Post Fetch Limit: It allows fetching up to 500 posts at a time.

Time Filter Available: It supports time filtered searches, which is a significant advantage over PRAW.

However, PushShift has its own limitation: it provides post titles but not the content.

One potential solution could involve using PushShift to fetch post IDs, and then using PRAW to fetch the content for each post. However, we would still need to respect the 30 requests per minute limit imposed by PRAW. This might result in pausing after every 29-30 posts, or introducing a few minutes of delay to let the API fetch data gradually.

samedtorunn commented 1 year ago

Twigly's API tried manually. Twigly Websites

It did not fetch enough posts, actually it fetched much fewer posts for the same search.