ohn0 / youtube-livechat-scraper

grab youtube live chat data from existing VODs
MIT License
10 stars 2 forks source link

State Retention in `LiveChatScraper` Causes Unexpected Behavior #18

Closed repollo closed 1 year ago

repollo commented 1 year ago

Description:

The LiveChatScraper class retains state between successive scrapes, leading to unexpected behavior and potential data corruption. This state retention manifests especially when scraping multiple videos in succession.

Steps to Reproduce:

  1. Instantiate a LiveChatScraper object.
  2. Call the scrape method on a YouTube video with a live chat.
  3. Without creating a new LiveChatScraper object, call the scrape method on another video.
  4. Observe the unexpected behavior and potential errors.

Expected Behavior:

Each call to the scrape method should behave as if it's the first time, with no retained state from previous calls.

Observed Behavior:

State is retained between calls to the scrape method, leading to potential errors and data corruption.

Workaround:

A manual call to the reset method of the LiveChatScraper object can be made between successive scrapes to clear the state. However, this is not intuitive and can easily be missed, leading to issues.

def reset(self):
    self.contentSet = []
    self.currentOffsetTimeMsec = 0
    self.player_state = None
    self.requestor = None
    self.initialization_successful = False

Proposed Solution:

  1. Ensure that all state is cleared at the beginning of the scrape method. Calling self.reset() at the beginning of the scrape method. And a reset if it fails or Exception:
    except Exception as ex:
    print("scraping failed")
    print(f"Exception encountered: {str(ex)}")
    self.reset()
  2. Alternatively, re-architect the LiveChatScraper class to avoid global state or ensure that a new instance is required for each scrape.

Additional Notes:

This issue was identified when scraping multiple videos in succession without manually resetting the state. A temporary fix involving a call to reset was implemented, but a more permanent solution in the library would be beneficial. Issue became more apparent when concurrent scraping.

ohn0 commented 1 year ago

hmmmm I'm going to look at how it's sharing the state.

I see it happening with 2 separate LiveChatScraper objects, gonna investigate.

ohn0 commented 1 year ago

ahhh I see the issue

tee hee forgot how Python treats variables in classes

ohn0 commented 1 year ago

Alright I have a branch for this on movingSourceFilesToSrcForCleanup

repollo commented 1 year ago

Nice! Thanks!

Want me to close this or want to leave it open till PR?

ohn0 commented 1 year ago

sure you can close it. Thanks for the find!