jwzimmer-zz / tv-tropes

UVM Stat 287 Final Project repo - network of tropes from TV Tropes wiki
MIT License
2 stars 3 forks source link

use index dicts to make trope list to compare to list from Main directory #9

Closed jwzimmer-zz closed 3 years ago

jwzimmer-zz commented 3 years ago

we could make a set of all the trope titles from all the indices pages (using the dicts) and check that against the list of tropes from the main folder on gh? the idea being that when i downloaded the site if i missed something from one of those i got it in the other?

@nguyenhphilip has a list from the main directory we can compare to

there seem to be some just a few tropes and indices that are listed on https://tvtropes.org/pmwiki/index_report.php but are not in https://github.com/jwzimmer/tv-tropes/tree/main/tvtropes.org/pmwiki/pmwiki.php/Main (phil has another list for those few pages too)... that could be because the page is just not a complete record of the contents of the folder and vice versa, or it could be because i missed out some of the pages when i was downloading the site. so we should do our best to verify that it isn't the latter (because obviously we don't want to miss out any tropes in our analysis).

what about: https://github.com/jwzimmer/tv-tropes/blob/main/tvtropes.org/index.html

jwzimmer-zz commented 3 years ago

also: make a list from https://github.com/jwzimmer/tv-tropes/blob/main/tvtropes.org/index.html

jwzimmer-zz commented 3 years ago

(overarching goal: satisfy ourselves everything we care about from the website is now on GH, re https://github.com/jwzimmer/tv-tropes/issues/1)

nguyenhphilip commented 3 years ago

after filtering out any files with 'Index' or 'Trope' in their filename, there are still 4936 files to work with, although it's very likely that some of the files in here still aren't only individual trope pages. I've stored the names of the pages with 'Index' or 'Trope' in them as separate JSON files in case we need to use them. It's unclear to me how many files/data objects we need to do a network analysis, but it seems like even with ~4000 individual trope pages we can probably find a large number of interesting connections between them.

As for parsing the HTML itself, it looks like the main content of individual tropes is structured within a <div> with attribute id = main_article'. Links within main_article are nested in paragraphs <p> and seem to link only to other individual tropes, indexes, or trope category pages.

How to filter non-individual trope pages out?

nguyenhphilip commented 3 years ago

I made a list of all the items inside any file starting with 'txt_dictfrom', as these are the indices we want to check the contents of Main against. These are saved as json files as per the last commit.

If we remove the files in Main that are not in this 'txt_dictfrom' list, we are left with 3689 items. These items look like tropes though, so we might actually not want to remove them at all, or at least we'll have to go through with our eyes and see what else needs to be filtered. Seems like this will be easier once we determine more clearly what we're looking for, which may depend on the structure of the individual tropes themselves and how they link to other tropes. This might be easy though once we have a list of individual tropes, from Main or some combination of the various files we haven't checked yet (though I think Main is our best bet) since we can just filter links out that don't link to any other item in our individual trope list.

nguyenhphilip commented 3 years ago

The 'txt_dictfrom' items that are not in Main ALSO look like tropes, so these are definitely files not in Main that we may be interested in. That makes me wonder though... where are they if not in Main?

nguyenhphilip commented 3 years ago

OH ! So it looks like the folder pmwiki.php, which holds Main, holds other folders with other tropes in it as well... this makes the search a bit harder though since there appears to be random stuff in here as well, not just individual tropes. Which is to say, with this giant list, we need a way to QC. Might be more feasible to focus on using Main since the things that make it into Main are probably the primary files, which are presumably better maintained/more active pages?

Will need to look into the actual contents of the other folders inside pmwiki.php. Still betting on Main as our main data source.

jwzimmer-zz commented 3 years ago

@nguyenhphilip this is great, thanks! I think that plan totally makes sense. Let's look (manually) at what kind of trope pages are in the folders besides Main in pmwiki.php and verify that they'll be reasonable to exclude. I think if we have some way to categorize those pages it will almost certainly be okay to look just at what's in Main, but let's have some handle on what we're excluding, you know? I'll do that now.

jwzimmer-zz commented 3 years ago

I think we might just want to use the tropes listed in https://tvtropes.org/pmwiki/pagelist_having_pagetype_in_namespace.php?n=Main&t=trope.

This is their manually categorized list of what counts as a trope, which might be both a justifiable and concise way to decide which articles to consider and which to exclude.

nguyenhphilip commented 3 years ago

Using above trope list, we were able to extract 27485 files for individual tropes !! This looks like all of them. Super exciting because we can investigate lots of research Qs! Step 1 extract actual contents of trope pages and explore?

jwzimmer-zz commented 3 years ago

Resolved by @nguyenhphilip in https://github.com/jwzimmer/tv-tropes/commit/61388ba0fe20a0cdf2b26fc0b72d4758cdbc6de7! : )

jwzimmer-zz commented 3 years ago

Per discussion with Phil, reopening briefly for one last hurrah of trying to decide what counts as a trope -

we need to decide between: using strictly what they've labeled as tropes as the master list OR using that PLUS some tropes we identify as tropes that are in Main but not the masterlist. But the issue with that is it pulls in many non-trope pages which we'll have to filter out

if there are not too many "obvious tropes" like Always Male, we should just append them to the master list But if there are a lot of files, and it isn't obvious what is a trope, a trope of tropes, an index, etc., then we should stick with. their master list strictly as the definition of "trope"

jwzimmer-zz commented 3 years ago

https://github.com/jwzimmer/tv-tropes/commit/a180ca8b343acd198c0c73054d7f638fdc447984 results: Pages in Main but not in masterlist 2124 Pages in masterlist but not in Main 23842

(am i only looking at the first part of Main or something? or are there really that many things in the masterlist compared to main? well, either way, 2000 is a lot - so let's go with masterlist as definition of "trope", although indexes and metatropes are still useful, but...)

for the sake of having a stable definition let's go with masterlist - Phil agrees