spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
770 stars 129 forks source link

Parsing sentences (Wikivoyage) #575

Closed wginsberg closed 2 months ago

wginsberg commented 3 months ago

First - thanks for this project!

I noticed that there is an issue parsing the sentences on the Wikivoyage page for Sault Sainte Marie (Ontario)

Using the following code:

wtf(text)?.sentences()?.[0]?.text()

I am getting the result

'''[https://saulttourism.com/ Sault Ste.

Obviously looks like the period is being interpreted as the end of the sentence, but i do not have the same problem with other pages. E.g. St. Clairsville (Ohio) comes back as expected:

St. Clairsville is a city of 5,100 people (2020) in western Belmont County, Ohio.

This is the only instance of the issue I have found so far, so I guess it is not a huge deal. Though I wonder if it would be a nice feature to have a set of strings to pass to sentences() that would prevent breaking?

I.e. calling

doc.sentences(["string including a. period"])

Would guarantee that no sentence would break across that string? Just a thought.

spencermountain commented 3 months ago

hey Will, ya, you're right. we have a list of hard-coded abbreviations, and no way to augment it on the fly right now, and that's a lousy solution. Happy to take a look at this. A proper solution may require a breaking change. A quicker one may just be adding 'ste' to the list. cheers

spencermountain commented 2 months ago

hey Will, I've added ste and a known abbreviation, and am putting augmenting this list as a table-stakes feature for v11. you should see the change in 10.3.2. cheers