jwzimmer-zz / tv-tropes

UVM Stat 287 Final Project repo - network of tropes from TV Tropes wiki
MIT License
3 stars 3 forks source link

Start making something from within the individual trope articles #8

Closed jwzimmer-zz closed 3 years ago

jwzimmer-zz commented 3 years ago

Next big step is to start parsing the individual trope articles for connections.

jwzimmer-zz commented 3 years ago

Use https://github.com/jwzimmer/tv-tropes/tree/main/trope_list

jwzimmer-zz commented 3 years ago

https://github.com/jwzimmer/tv-tropes/commit/830c4c4df1c6a7f6a2f1b8dd90a6e1ee6831708a

My script seems to be doing what we want, BUT when I tried to run it on every trope html file in trope_list it froze my computer.

jwzimmer-zz commented 3 years ago

Well, https://github.com/jwzimmer/tv-tropes/blob/main/individualtropepage.py seems to work, but as mentioned above, running it on all the tropes kills my computer. So I tried running it just on Z tropes - ok, no problem. Then just on A tropes - computer froze. So I think we have all the Z trope article's linked tropes, but just some of the As, and an unknown portion of the rest. Better or more efficient strategy may be required @nguyenhphilip ideas?

Example dict: https://github.com/jwzimmer/tv-tropes/blob/main/linked_trope_dict_from_ADayInTheLimelight.json

jwzimmer-zz commented 3 years ago

Done! Verified that I stepped through all the files in the trope_list folder: len(it.alltropes) == it.count True Only 1 error... I will open a case for handling whatever happened there.

jwzimmer-zz commented 3 years ago

Done except for issue with one page: https://github.com/jwzimmer/tv-tropes/issues/11

Should also randomly spot-check some pages to make sure the dicts are correct. I can do that tmrw.

nguyenhphilip commented 3 years ago

Oh nice you got it to work! I saw the comment from 2 hrs ago and started running a script of my own.. then I just saw that you figured it out haha. Looking at issue 11 it looks similar to an issue I had, which is that some <a> links don't have the 'href' attribute, meaning they don't link to any page. The way I filtered those out was by using if trope.has_attr('href') . Anyways it might be good just to cross reference our lists

jwzimmer-zz commented 3 years ago

Yes! That sounds great! : )

nguyenhphilip commented 3 years ago

Link to the giant list

jwzimmer-zz commented 3 years ago

Making one list instead of a million files. Good call. I should have done that!

nguyenhphilip commented 3 years ago

Haha we'll see how it works out when we load things into a data frame! I think it should be fine since I've loaded ~300mb files into python though : )

jwzimmer-zz commented 3 years ago

I think what we've done so far counts as "start making some kind of something".