janetzki / fact_extraction

Fact Extraction from Text
6 stars 0 forks source link

BeautifulSoup mixes up different articles #77

Closed janetzki closed 7 years ago

janetzki commented 7 years ago

They start again inside themselves. E.g. for https://en.wikipedia.org/wiki/Akira_Kurosawa:

references to the Nagasaki bombing came from the director rather than from the book. This person | name = Akira Kurosawa<br/>{{lang|ja|黒澤 明}} | image = Akirakurosawa-onthesetof7samurai-1953-page88.jpg | caption = Akira Kurosawa on the set of ''Seven Samurai'' in 1953

janetzki commented 7 years ago

It seems that the problem lies not in the dump but a wrong parsing with BeautifulSoup.

janetzki commented 7 years ago

It's even worse: BeautifulSoup mixes articles up, e.g., https://en.wikipedia.org/wiki/Alan_Turing and https://en.wikipedia.org/wiki/Alexander_the_Great: Turing was one of four mathematicians examined in the BBC documentary entitled ''Dangerous Knowledge'' (2008).&lt;ref&gt;{{cite web|url=http://www.bbc.co.uk/bbcfour/documentaries/features/dangerous-knowledge.shtml|title=Dangerous Knowledge|publisher=BBC Four|date=11 June 2008|accessdate=25 September 2009}}&lt;/ref&gt; becomes: Turing was one of four mathematicians examined in the BBC documentary entitled ''Dangerous Knowledge'' (2008).<ref>{ l e ] ] a n d p r o v i d e d t h e T e m p l e o f t h e N y m p h s a t

janetzki commented 7 years ago

It isn't even determined, but it occurs at the same place again, this time mixed up with https://en.wikipedia.org/wiki/Arthur_Schopenhauer: Turing was one of four mathematicians examined in the BBC documentary entitled ''Dangerous Knowledge'' (2008).&lt;ref&gt;{ o n t a i n e d s u p e r h u m a n c o n c e p t s . T h e U p a n i s h a d s w a s a g r e a t s o u r c e o f i n s p i r a t i o n t o S c h o p e n h a u e r . W r i t i n g a b o u t t h e m , h e s a i d :