nianeyna / ao3downloader

Utility for downloading fanfiction in bulk from the Archive of Our Own
GNU General Public License v3.0
201 stars 18 forks source link

Link with a lot of exclusions downloads more pages than it should #105

Closed Kyther closed 10 months ago

Kyther commented 1 year ago

I tried a link to a fandom where I had set a LOT of exclusions. Probably 40 or 50 fandoms, plus a couple ships and about a dozen additional tags (4496 characters for the entire URL string). The browser seemed to load the page fine, reduced the total number of fics by the correct number as I added additional fandoms. At the end it said 206 pages for all the fics (about 4.1k fics).

And then I ran the link through to grab metadata only (I extract the URLs from that and run them through separately) and it kept going, and going, and going. It went past 600 pages and wouldn't stop. I have no idea what it's downloading because it doesn't write to a file as it goes along, which would really help with troubleshooting like this. I finally hit Ctrl+C to kill it because I had no idea how long it would go on - and it clearly wasn't grabbing what I wanted.

Is there a hard limit to the number of characters in the link that it can process, or is something else going on here? (Or is there any way to have it write to the file as it goes so I can see what it's actually getting?)

verotheelf commented 1 year ago

I believe there's actually a character limitation for a command line input, not the program itself

nianeyna commented 1 year ago

You should be able to see the page urls it crawled in the log, that will tell you if the input was truncated. I didn't realize there was a limit on command line input length although it makes sense that there would be. That said a quick google suggests 4500 characters should be under the limit, so I suspect there's something else going on here.

Kyther commented 1 year ago

I forgot all about the log. headdesk

OK, looking at it, it appears it did truncate - cut off the last 401 characters of the URL (though it did add the &page=2, &page=3, etc.). I have no idea why 4095 would be the magic cutoff, though! It seems a very odd number to cut off at.

Given that, it appears it was grabbing the metadata for every item on the archive minus the exclusions that made it past the cutoff - NOT limited to the fandom I wanted, LOL. I think it would've run that one for days. But I removed all the excluded fandoms and added back in a smaller number of them (which added in just a couple hundred fics I didn't want), and that fortunately worked fine.

verotheelf commented 1 year ago

Oops not used to using programs that use input instead of command-line arguments. There is a much smaller character limit in this case, accurately identified as below 4096. If I remember correctly, you use Linux? If so there seems to be some information out there about how to change the default. Here's one guide: https://stackoverflow.com/questions/18015137/linux-terminal-input-reading-user-input-from-terminal-truncating-lines-at-4095

Mac user myself so that's a smidge beyond my understanding

Kyther commented 1 year ago

Ohhhh, suddenly it makes more sense. Yeah, Linux here - and under 4096 makes sense with that number being a power of 2, now the number doesn't look so random, lol.

Eh, I'll just make sure I don't hit that limit in the future - the workaround looked a bit unclear and it's unlikely I'll actually run into it all that much. I mean, this is the first time I have and I'm nearly done with all the fandoms, so. I just need to not exclude everything under the sun, lol. Easier to just grab a little more, even if I won't read it.

verotheelf commented 1 year ago

I found a simpler explanation. Apparently you just need to put your terminal in noncanonical mode before running the program by entering stty -icanon. When you're done, change it back with stty icanon