nosmokingbandit / Watcher3

Other
279 stars 59 forks source link

iMDB Sync Issues #157

Closed pequalsmp closed 6 years ago

pequalsmp commented 6 years ago

After noticing that a couple of movies are missing from my library, i've checked the logs and found the following error, displayed every time there's an attempt to sync iMDB rss feed:

TypeError: strptime() argument 1 must be str, not None
    last_sync = datetime.strptime(last_sync, self.date_format)
  File "/Watcher3/core/rss/imdb.py", line 53, in get_rss
    self.task()
  File "/Watcher3/core/cp_plugins/taskscheduler.py", line 254, in _task
Traceback (most recent call last):
pequalsmp commented 6 years ago

The value None is a valid value for the key and as such the default is never used in the following:

last_sync = record.get(list_id, 'Sat, 01 Jan 2000 00:00:00 GMT')
last_sync = datetime.strptime(last_sync, self.date_format)

source

Not so sure how the value was set to to null in the database in the first place, but manually updating the value worked fine.

Not sure if its a valid issue, feel free to re-open if necessary.

pequalsmp commented 6 years ago

It appears that parse_build_date fails to parse the lastBuildTime sometimes (malformed XML?).

Is there a reason why the response is used as time source? Isn't simple to just use the computer time as a source for the last sync time?

barbequesauce commented 6 years ago

Experiencing the same here, for what it’s worth. Pasting from the onscreen log if you’re wondering why it looks backwards...:

TypeError: must be str, not None
    last_sync = datetime.strptime(last_sync, self.date_format)
  File "/opt/Watcher3/core/rss/imdb.py", line 53, in get_rss
    self.task()
  File "/opt/Watcher3/core/cp_plugins/taskscheduler.py", line 254, in _task
Traceback (most recent call last):
WARNING 2017-12-02 19:47:19,381 CPTaskScheduler._task: Scheduled Task IMDB Sync Failed:
nosmokingbandit commented 6 years ago

Should be fixed in 80b56ede1d7e895f160de36fff8086f07879b759. A slight misunderstanding on my part about exactly how {}.get() works.

barbequesauce commented 6 years ago

Still experiencing errors related to IMDB... log posted from screen so reverse order.

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 29, column 1170
  File "<string>", line None
    parser.feed(text)
  File "/usr/lib/python3.4/xml/etree/ElementTree.py", line 1325, in XML
    root = ET.fromstring(feed)
  File "/opt/Watcher3/core/rss/imdb.py", line 109, in parse_build_date
    record[list_id] = self.parse_build_date(response)
  File "/opt/Watcher3/core/rss/imdb.py", line 55, in get_rss
    self.task()
  File "/opt/Watcher3/core/cp_plugins/taskscheduler.py", line 254, in _task
Traceback (most recent call last):
WARNING [2017-12-17 18:30:14,904] CPTaskScheduler._task.256: Scheduled Task IMDB Sync Failed:
nosmokingbandit commented 6 years ago

IMDB disabled rss lists. I don't know if they intend to bring them back or not.

barbequesauce commented 6 years ago

Welp... to trakt we go, I guess. Any thought on allowing other lists besides defaults?

nosmokingbandit commented 6 years ago

Several people have asked about it and my answer is always the same. I don't have a Trakt account and I'm not going to pay for one. To add that functionality I need a copy of an rss feed so I can know how to parse it. If anyone with a Trakt account sends me a copy of their rss feed (the actual rss contents, not the url) I can add it relatively quickly.

watchernzb@gmail.com

pequalsmp commented 6 years ago

The issue is that, while using React, iMDB is preloading the initial state, so you have to extract it from the HTML. Its doable with something like Scrapy and Splash but this would add new dependencies.

In the mean time, a workaround -- actually a hack, a really nasty hack, can be:

import re
import urllib.request

content = urllib.request.urlopen("http://www.imdb.com/user/urXXXXXXXX/watchlist").read()

initial_state= re.findall(r"IMDbReactInitialState\.push\((.*?)\);\\n", str(content))

# Get the IMDbReactInitialState, which contains an array with the user movies
for match in initial_state:
    ids= re.finditer(r"tt\d{7}", match)

    # Look for imdb ids in the initial state
    for match in ids:
        print(match.group())

This example will filter the ids from a user's Watchlist, which can be used later on, in order to get more info. I'm not sure if iMDB prevents crawling or how long this might work, but it seems iMDB is hellbent on making sure you're using their paid API even for benign functionality likeWatchlist.

barbequesauce commented 6 years ago

Sample watchlist sent in email.

nosmokingbandit commented 6 years ago

@enilfodne

I try to avoid scraping if at all possible. It is easy to break or cause all sorts of other weird problems. If IMDB decides they are killing rss forever I'll probably look into downloading the list csv and parsing that instead.

@barbequesauce

Got it, thanks!

barbequesauce commented 6 years ago

Thank you for jumping on this! Looks great.

nosmokingbandit commented 6 years ago

I'm closing this. As of today IMDB still has rss disabled. I may look at using the csv to sync in the future, but that is not a project for today.