jwzimmer-zz / tv-tropening

1 stars 0 forks source link

download tvtropes content #2

Closed jwzimmer-zz closed 3 years ago

jwzimmer-zz commented 3 years ago

in https://github.com/jwzimmer/tv-tropes, we downloaded lots of content from tvtropes. BUT we didn't know what we were doing, so we may have gone about it in an impolite way. We also want to include content this time we didn't last time (works, tropes not tagged as such, php).

Would this work to considerately download the relevant things? wget -r -A.html,.php --limit-rate=10k tvtropes.org.

From googling it, I believe --limit-rate will keep my wget requests from being onerous to the tvtropes website.

Updates:

jwzimmer-zz commented 3 years ago

Command used: wget -r -Q2000m -R.gif,.php,.img,.jpg,.png,.jpeg,.pdf,.bmp,.tiff,.eps,.tmp,.mp4,.mov,.wmv,.flv,.avi,.mkv,.avchd,.css,.ico --limit-rate=100k tvtropes.org

Process finally stopped:

--2021-01-08 20:54:50--  https://tvtropes.org/pmwiki/pmwiki.php/Main/MacGuffinSuperPerson
Reusing existing connection to tvtropes.org:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘tvtropes.org/pmwiki/pmwiki.php/Main/MacGuffinSuperPerson’

tvtropes.org/pmwiki     [     <=>            ]  98.67K   103KB/s    in 1.0s    

2021-01-08 20:54:51 (103 KB/s) - ‘tvtropes.org/pmwiki/pmwiki.php/Main/MacGuffinSuperPerson’ saved [101038]

FINISHED --2021-01-08 20:54:51--
Total wall clock time: 1d 9h 28m 26s
Downloaded: 17302 files, 2.0G in 1d 8h 21m 33s (17.6 KB/s)
Download quota of 2.0G EXCEEDED!
jwzimmer-zz commented 3 years ago

This is a ton of files. And since it's ~17000, it definitely doesn't include every trope. So maybe I need to be more specific in which folders I download. I definitely want Main, since we know most of the tropes are there, but in order to get tropes that aren't there, which folders do I need?

E.g. https://github.com/jwzimmer/tv-tropening/blob/main/tvtropes.org/pmwiki/pmwiki.php/BigBad/Literature is about the BigBad trope, so we might want such folders even though they aren't in Main. On the other hand, we probably don't need AlliterativeName (https://github.com/jwzimmer/tv-tropening/blob/main/tvtropes.org/pmwiki/pmwiki.php/AlliterativeName/TropesAToE).

I don't think there's an obvious way to distinguish between folders we want and folders we don't care about... it looks like it has to be decided on a case by case basis.

To start with, I probably want at least: Literature, Main

For whatever reason, although I thought passing the path to Literature ending in Literature/ to wget would let me download every file in that directory, it was only getting index.html. So I passed in Literature/ABrothersPrice and that did seem to work. I ended up with about 1490 files. I tried again with another book title as wget -r -Q2000m -R.gif,.php,.img,.jpg,.png,.jpeg,.pdf,.bmp,.tiff,.eps,.tmp,.mp4,.mov,.wmv,.flv,.avi,.mkv,.avchd,.css,.ico --limit-rate=100k --no-parent tvtropes.org/pmwiki/pmwiki.php/Literature/ABootStompingAHumanFaceForever , which returned a vastly different number of files (about 11000). So I'm not sure what to make of that.

Every time I run it many of the files have significant changes - is that due to genuine editing of the site in between the times I try to scrape it, or is that due to some kind of problem figuring out which files are the same? I would expect the filename to work for that, but...?

jwzimmer-zz commented 3 years ago

Tried again to slowly, politely get Main contents:

2021-01-11 15:41:33 (111 KB/s) - ‘tvtropes.org/pmwiki/pmwiki.php/Main/RabidCop’ saved [107803]

FINISHED --2021-01-11 15:41:33--
Total wall clock time: 19h 40m 23s
Downloaded: 17105 files, 2.0G in 19h 1m 0s (29.9 KB/s)
Download quota of 2.0G EXCEEDED!

Using the command wget -r -Q2000m -R.gif,.php,.img,.jpg,.png,.jpeg,.pdf,.bmp,.tiff,.eps,.tmp,.mp4,.mov,.wmv,.flv,.avi,.mkv,.avchd,.css,.ico --limit-rate=100k --no-parent tvtropes.org/pmwiki/pmwiki.php/Main/

So... I am not really sure what to do to try to get all the tropes without using up a ton of space. How did we get all the tropes before?

jwzimmer-zz commented 3 years ago

From talking to Phil, it seems like trying to make a copy of everything in Main and everything in Literature is going to take up too much space. Let's try getting every work from Literature, but not the ?edit and ?source pages, and then we can get the tropes that come up in those works?

jwzimmer-zz commented 3 years ago

Going through Literature again finished with 9764 files.

2021-01-13 17:20:30 (120 KB/s) - ‘tvtropes.org/pmwiki/pmwiki.php/Literature/TheChampions’ saved [59046]

FINISHED --2021-01-13 17:20:30--
Total wall clock time: 5h 25m 47s
Downloaded: 9764 files, 598M in 4h 37m 19s (36.8 KB/s)
jwzimmer-zz commented 3 years ago

And again... finished with 9393 files.

2021-01-15 10:03:04 (99.8 KB/s) - ‘tvtropes.org/pmwiki/pmwiki.php/Literature/Carnivorine?action=source’ saved [45216]

FINISHED --2021-01-15 10:03:04--
Total wall clock time: 18h 39m 59s
Downloaded: 9393 files, 577M in 16h 21m 30s (10.0 KB/s)
jwzimmer-zz commented 3 years ago

we may come back to this later but for now we are focusing on the character space project