Closed jwzimmer-zz closed 3 years ago
Command used: wget -r -Q2000m -R.gif,.php,.img,.jpg,.png,.jpeg,.pdf,.bmp,.tiff,.eps,.tmp,.mp4,.mov,.wmv,.flv,.avi,.mkv,.avchd,.css,.ico --limit-rate=100k tvtropes.org
Process finally stopped:
--2021-01-08 20:54:50-- https://tvtropes.org/pmwiki/pmwiki.php/Main/MacGuffinSuperPerson
Reusing existing connection to tvtropes.org:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘tvtropes.org/pmwiki/pmwiki.php/Main/MacGuffinSuperPerson’
tvtropes.org/pmwiki [ <=> ] 98.67K 103KB/s in 1.0s
2021-01-08 20:54:51 (103 KB/s) - ‘tvtropes.org/pmwiki/pmwiki.php/Main/MacGuffinSuperPerson’ saved [101038]
FINISHED --2021-01-08 20:54:51--
Total wall clock time: 1d 9h 28m 26s
Downloaded: 17302 files, 2.0G in 1d 8h 21m 33s (17.6 KB/s)
Download quota of 2.0G EXCEEDED!
This is a ton of files. And since it's ~17000, it definitely doesn't include every trope. So maybe I need to be more specific in which folders I download. I definitely want Main, since we know most of the tropes are there, but in order to get tropes that aren't there, which folders do I need?
E.g. https://github.com/jwzimmer/tv-tropening/blob/main/tvtropes.org/pmwiki/pmwiki.php/BigBad/Literature is about the BigBad trope, so we might want such folders even though they aren't in Main. On the other hand, we probably don't need AlliterativeName (https://github.com/jwzimmer/tv-tropening/blob/main/tvtropes.org/pmwiki/pmwiki.php/AlliterativeName/TropesAToE).
I don't think there's an obvious way to distinguish between folders we want and folders we don't care about... it looks like it has to be decided on a case by case basis.
To start with, I probably want at least: Literature, Main
For whatever reason, although I thought passing the path to Literature ending in Literature/ to wget would let me download every file in that directory, it was only getting index.html. So I passed in Literature/ABrothersPrice and that did seem to work. I ended up with about 1490 files. I tried again with another book title as wget -r -Q2000m -R.gif,.php,.img,.jpg,.png,.jpeg,.pdf,.bmp,.tiff,.eps,.tmp,.mp4,.mov,.wmv,.flv,.avi,.mkv,.avchd,.css,.ico --limit-rate=100k --no-parent tvtropes.org/pmwiki/pmwiki.php/Literature/ABootStompingAHumanFaceForever
, which returned a vastly different number of files (about 11000). So I'm not sure what to make of that.
Every time I run it many of the files have significant changes - is that due to genuine editing of the site in between the times I try to scrape it, or is that due to some kind of problem figuring out which files are the same? I would expect the filename to work for that, but...?
Tried again to slowly, politely get Main contents:
2021-01-11 15:41:33 (111 KB/s) - ‘tvtropes.org/pmwiki/pmwiki.php/Main/RabidCop’ saved [107803]
FINISHED --2021-01-11 15:41:33--
Total wall clock time: 19h 40m 23s
Downloaded: 17105 files, 2.0G in 19h 1m 0s (29.9 KB/s)
Download quota of 2.0G EXCEEDED!
Using the command wget -r -Q2000m -R.gif,.php,.img,.jpg,.png,.jpeg,.pdf,.bmp,.tiff,.eps,.tmp,.mp4,.mov,.wmv,.flv,.avi,.mkv,.avchd,.css,.ico --limit-rate=100k --no-parent tvtropes.org/pmwiki/pmwiki.php/Main/
So... I am not really sure what to do to try to get all the tropes without using up a ton of space. How did we get all the tropes before?
From talking to Phil, it seems like trying to make a copy of everything in Main and everything in Literature is going to take up too much space. Let's try getting every work from Literature, but not the ?edit and ?source pages, and then we can get the tropes that come up in those works?
Going through Literature again finished with 9764 files.
2021-01-13 17:20:30 (120 KB/s) - ‘tvtropes.org/pmwiki/pmwiki.php/Literature/TheChampions’ saved [59046]
FINISHED --2021-01-13 17:20:30--
Total wall clock time: 5h 25m 47s
Downloaded: 9764 files, 598M in 4h 37m 19s (36.8 KB/s)
And again... finished with 9393 files.
2021-01-15 10:03:04 (99.8 KB/s) - ‘tvtropes.org/pmwiki/pmwiki.php/Literature/Carnivorine?action=source’ saved [45216]
FINISHED --2021-01-15 10:03:04--
Total wall clock time: 18h 39m 59s
Downloaded: 9393 files, 577M in 16h 21m 30s (10.0 KB/s)
we may come back to this later but for now we are focusing on the character space project
in https://github.com/jwzimmer/tv-tropes, we downloaded lots of content from tvtropes. BUT we didn't know what we were doing, so we may have gone about it in an impolite way. We also want to include content this time we didn't last time (works, tropes not tagged as such, php).
Would this work to considerately download the relevant things?
wget -r -A.html,.php --limit-rate=10k tvtropes.org
.From googling it, I believe --limit-rate will keep my wget requests from being onerous to the tvtropes website.
Updates: