nianeyna / ao3downloader

Utility for downloading fanfiction in bulk from the Archive of Our Own
GNU General Public License v3.0
169 stars 15 forks source link

Error encountered getting links list / list index out of range #125

Closed Doranwen closed 6 months ago

Doranwen commented 6 months ago

Tonight I attempted to run the script to get the metadata for all the fics in the Tolkien works tag past a specific date: https://archiveofourown.org/works?work_search%5Bsort_column%5D=revised_at&work_search%5Bother_tag_names%5D=&work_search%5Bexcluded_tag_names%5D=&work_search%5Bcrossover%5D=&work_search%5Bcomplete%5D=&work_search%5Bwords_from%5D=&work_search%5Bwords_to%5D=&work_search%5Bdate_from%5D=2023-12-17&work_search%5Bdate_to%5D=&work_search%5Bquery%5D=&work_search%5Blanguage_id%5D=&commit=Sort+and+Filter&tag_id=TOLKIEN+J*d*+R*d*+R*d*+-+Works+*a*+Related+Fandoms

It failed immediately with this error message: Error encountered while getting links list. List may not be complete. list index out of range

If it helps, this is what the log file produced: {"message": "Error encountered while getting links list. List may not be complete.", "error": "descriptor 'find' for 'str' objects doesn't apply to a 'NoneType' object", "success": false, "stacktrace": "Traceback (most recent call last):\n File \"/home/doranwen/Programs/Scripts/ao3downloader/ao3downloader-main/ao3downloader/ao3.py\", line 61, in get_work_links\n self.get_work_links_recursive(links_list, link, visited_series, metadata)\n File \"/home/doranwen/Programs/Scripts/ao3downloader/ao3downloader-main/ao3downloader/ao3.py\", line 90, in get_work_links_recursive\n urls = parse_soup.get_work_and_series_urls(thesoup, self.series)\n File \"/home/doranwen/Programs/Scripts/ao3downloader/ao3downloader-main/ao3downloader/parse_soup.py\", line 149, in get_work_and_series_urls\n series_urls = get_series_urls(soup, get_all)\n File \"/home/doranwen/Programs/Scripts/ao3downloader/ao3downloader-main/ao3downloader/parse_soup.py\", line 118, in get_series_urls\n return list(dict.fromkeys(list(\n File \"/home/doranwen/Programs/Scripts/ao3downloader/ao3downloader-main/ao3downloader/parse_soup.py\", line 120, in <lambda>\n filter(lambda a : is_series(a, get_all, bookmarks),\n File \"/home/doranwen/Programs/Scripts/ao3downloader/ao3downloader-main/ao3downloader/parse_soup.py\", line 126, in is_series\n series_number = parse_text.get_series_number(element.get('href'))\n File \"/home/doranwen/Programs/Scripts/ao3downloader/ao3downloader-main/ao3downloader/parse_text.py\", line 31, in get_series_number\n return get_digits_after('/series/', link)\n File \"/home/doranwen/Programs/Scripts/ao3downloader/ao3downloader-main/ao3downloader/parse_text.py\", line 43, in get_digits_after\n index = str.find(url, test)\nTypeError: descriptor 'find' for 'str' objects doesn't apply to a 'NoneType' object\n", "timestamp": "02/08/2024, 22:44:39"}

If it's relevant, I have modified the script to wait 7 seconds in between calls to AO3, as recommended in this thread, so that may push the line numbers off slightly with repo.py. I've also modified the filenames per this thread so it may adjust line numbers accordingly in parse_soup.py.

Doranwen commented 6 months ago

This error is obviously linked to some anomaly with a particular work, because now when I try it it gets all the metadata until it's starting page 16 and then ends abruptly with the message: Error encountered while getting links list. List may not be complete. (There are 129 pages total in that link currently.)

Edit: Because I somehow didn't mention it - all of my other fandoms behaved perfectly fine when I ran updates. Only Tolkien Works is having issues, for some reason.

I'm wondering if it has to do with the fake link in the summary to at the museum, with you across the way by poeticmemory. It appears in both lists and it's odd.

nianeyna commented 6 months ago

Oh. I know what this is. It's an anchor with no href. I missed a null check there. I'm not sure when I'll get a chance to make a commit on this but if you want to fix it yourself it should work if you change line 120 of parse_soup.py to

filter(lambda a : a.get('href') and is_series(a, get_all, bookmarks),
Doranwen commented 6 months ago

Yep, that fixed it - it's working now! :D I can finish my updates, lol. Thanks!