jwngr / sdow

Six Degrees of Wikipedia
https://www.sixdegreesofwikipedia.com
MIT License
1.77k stars 91 forks source link

Contextually show where links can be found in the Wikipedia pages themselves #39

Open DyeffersonAz opened 6 years ago

DyeffersonAz commented 6 years ago

To show where the links were found, just because sometimes I can't find where this link is in the page.

jwngr commented 6 years ago

Thanks for the suggestion! I agree it would be a cool feature, but given the data source I'm using, it is not really easy to do. I don't ever actually see the full text of the Wikipedia page itself, just the Wikipedia database containing all the links. So I can't easily show you the context around where the link shows up in the actual page. Also, since the database is only updated monthly, it is possible the link is actually no longer on the page itself as it may have been edited since the latest database dump. Maybe I'll figure out a way to do this in the future, but for now, this is not feasible with my current architecture.

DyeffersonAz commented 6 years ago

You can't pick the HTML of the page, can you?

jwngr commented 6 years ago

I definitely could try something like that and I honestly think that is the way this would need to be implemented. But it wouldn't be very efficient and the system currently doesn't ever look at the raw HTML.

DyeffersonAz commented 6 years ago

Also, it would be better than needing to dump the database much times, it'd be automatic

jwngr commented 6 years ago

There is no way to do the actual search algorithm using live pages as it would take way too long. Thousands to tens of thousands of pages need to be touched. What I was referring to was just pull the context for a single page when you, for example, click on it in the graph view.

DyeffersonAz commented 6 years ago

Yep

Quifisto commented 4 years ago

Maybe you could look through the HTML after the search has completed. Then do some web scraping to look for the link on the page and return the title of the section or subsection it was found in.

DyeffersonAz commented 4 years ago

Maybe you could look through the HTML after the search has completed. Then do some web scraping to look for the link on the page and return the title of the section or subsection it was found in.

This is what I was suggesting. A way SDOW could go to the live wikipedia page and search for each link, then return the parent header of that <a> element, for example. I don't have knowledge in web-development to help with this yet, unfortunately.

xavzz commented 7 months ago

that would be very nice since I cannot find any links shown in the results on either of the pages requested