nianeyna / ao3downloader

Utility for downloading fanfiction in bulk from the Archive of Our Own
GNU General Public License v3.0
201 stars 18 forks source link

dict contains fields not in fieldnames: 'error' #110

Closed Kyther closed 10 months ago

Kyther commented 1 year ago

I get this error occasionally when grabbing metadata for very large links. In one case, I tried grabbing the metadata for the Tolkien & Related Works tag, no exclusions. Mind you, the tag's got over 5000 pages, and it doesn't paginate beyond 5000, so I had to set the number of pages to get to 5000 so it would stop at that point (and then go back later to grab anything that was dated before the last dates the first run would get to on page 5000). But it gave this error after it was done.

I thought nothing of it, until I ran it on a search results page that had 65k results. When it finished it gave the error as well. It said it went through all 3.2k pages, but when I looked at the resultant csv, it was way too small. So I opened it to check - and it had only 2169 fics listed in it! Like, what's going on there?

If it helps, this is the search string: https://archiveofourown.org/works/search?work_search%5Bquery%5D=podfic+NOT+%22podfic+welcome%22&work_search%5Btitle%5D=&work_search%5Bcreators%5D=&work_search%5Brevised_at%5D=&work_search%5Bcomplete%5D=&work_search%5Bcrossover%5D=&work_search%5Bsingle_chapter%5D=0&work_search%5Bword_count%5D=&work_search%5Blanguage_id%5D=&work_search%5Bfandom_names%5D=&work_search%5Brating_ids%5D=&work_search%5Bcharacter_names%5D=&work_search%5Brelationship_names%5D=&work_search%5Bfreeform_names%5D=&work_search%5Bhits%5D=&work_search%5Bkudos_count%5D=&work_search%5Bcomments_count%5D=&work_search%5Bbookmarks_count%5D=&work_search%5Bsort_column%5D=revised_at&work_search%5Bsort_direction%5D=desc&commit=Search

EDIT: Looks like it also fails on the One Direction (Band) tag. Only commonality I can find so far is it's all links with more than 3000 pages, but not knowing why it's doing this means I can't assume that the size is the reason. Any ideas?

EDIT #2: I can confirm that size isn't the issue precisely; I've successfully downloaded metadata for 2500 pages and had it fail on a link with under 2000 pages. I took the search string above and added in a time filter - < 4 years and > 4 years for two versions. I tried running the > 4 years one, and it gave the same error, with an under 3 mb result for over 1700 pages. I tried grabbing the metadata for Supernatural from the beginning up through 2012 (under 1600 pages) also—got the same error (7250 fics instead of the 31k it should've grabbed). I'm now starting to wonder just what this is, and getting a bit frustrated as every large fandom I try - even when I get it down to just a few thousand pages - fails. I've looked at the log file but there's absolutely nothing that shows up there; it just lists the pages as it's downloading them (or attempting to download them). So I really have absolutely no idea what this is, or how to fix, and it's completely blocked me from grabbing metadata for these big fandoms I wanted to get the metadata for.

Has anyone else gotten this? When did you get it? Does anyone have any ideas????

EDIT #3: I may have identified what is causing the error. By using csvtool to rip the list of links out of the csv, I went to see just how many it did save of a link to 11,858 fics which had the error. The resulting txt file had 7250 fic links, so I browsed in the pages until I found the one where 7250 was located - and it has another work linked in the fic summary. I'm guessing that totally threw off the metadata collection.

Below is the link to the exact page so you can see the fic in context:

https://archiveofourown.org/tags/Supernatural%20(TV%202005)/works?commit=Sort+and+Filter&page=363&work_search%5Bcomplete%5D=&work_search%5Bcrossover%5D=&work_search%5Bdate_from%5D=2012-01-01&work_search%5Bdate_to%5D=2012-12-31&work_search%5Bexcluded_tag_names%5D=&work_search%5Blanguage_id%5D=&work_search%5Bother_tag_names%5D=&work_search%5Bquery%5D=&work_search%5Bsort_column%5D=revised_at&work_search%5Bwords_from%5D=&work_search%5Bwords_to%5D=

As this list is bracketed on both sides with time constraints, it shouldn't shift which 7250 is, but in case it does, see the work titled The Winchester Road. I'm reasonably sure that their link to the podfic is what's causing the problems.

The question is, how can that be fixed so it's not an issue with large fandom metadata grabs, which (being large) run a greater risk of someone's linking another work in their own work's summary?

EDIT #4: I asked a friend with Python experience, and they suggested the following patch:

ao3downloader_getlinks.txt

(It was named with a .patch extension but Github doesn't support that for attachments, so I just changed it to txt as it's a text file anyway.)

Patch the actions/getlinks.py with that and it actually works without giving me the error - and all the fics were there!