spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
776 stars 129 forks source link

V9: Links page capitalization #356

Open ivan-kuzma-scx opened 4 years ago

ivan-kuzma-scx commented 4 years ago

Hello @spencermountain, it seems that after upgrading to new package version links are showing different first character capitalization.

For example topic Apple and it's 208th sentence.

doc.sentences(208).json();

For example, a link pointing to Food browning has capitalized page property, while others don't. On the wiki page they all pointing to capitalized pages.

0: Object {text: "enzyme", type: "internal", page: "enzyme"} 1: Object {text: "polyphenol oxidase", type: "internal", page: "polyphenol oxidase"} 2: Object {text: "browning", type: "internal", page: "Food browning"} 3: Object {text: "catalyzing", type: "internal", page: "catalysis"} 4: Object {text: "oxidation", type: "internal", page: "redox"} 5: Object {text: "o-quinones", type: "internal", page: "o-quinone"}

Could you please take a look

spencermountain commented 4 years ago

hey, sorry about that. This did change in 8.0.0 and was buried in here

If it helps, in wikipedia all pages are titlecased. If I can remember, this was done to simplify some of the text-output code. Maybe .json() should titlecase them, maybe .page() should titlecase them? I'm not sure. If anyone has any strong feelings one way or the other, let me know.

ivan-kuzma-scx commented 4 years ago

Thanks for the quick response and idea with tittlecasing. Have tryied it, but found a mismatch in link.page when it contains more than 1 word.

For example topic Milwaukee Bucks, and it's doc.sentences(133).json();

Object { text: "2001" type: "internal" page: "2001 NBA Playoffs" }.

Actual page is 2001 NBA playoffs with lower first character in playoffs word.

spencermountain commented 4 years ago

agh, yeah you're right. shoot. I never thought of that. This will probably have to be added to a major release.