ukwa / shine

Prototype SOLR-powered web archive exploration UI.
https://github.com/ukwa/shine/wiki
Apache License 2.0
42 stars 7 forks source link

Website Title filter #52

Open peterwebster opened 9 years ago

peterwebster commented 9 years ago

Hi @anjackson : could you confirm how the Website Title facet field is derived ? That everything contained within the host news.bbc.co.uk probably has Website Title = BBC News ?

If so, I think it is not being applied at the minute: see

http://www.webarchive.org.uk/shine/search/advanced?query=%22rowan+williams%22&action=search&websiteTitle=BBC+News&sort=content_type_norm&order=asc

I thought that these were picking up the term 'BBC' and 'News' from the page title, but that doesn't seem to explain all the results.

If I use a exact phrase, that suggests that it is indeed searching the page title. http://www.webarchive.org.uk/shine/search/advanced?query=%22rowan+williams%22&action=search&websiteTitle=%22BBC+News%22&sort=content_type_norm&order=asc

peterwebster commented 9 years ago

Although, the results are not the same when I query the Page Title with the same term, so it isn't a simple copying error between fields.

peterwebster commented 9 years ago

I'll have another look at these test cases.

@kinmanli to confirm what is happening

peterwebster commented 9 years ago

OK: this search is (I think) for 'BBC' and 'News' from anywhere in the page title:

http://www.webarchive.org.uk/shine/search/advanced?query=%22rowan+williams%22&sort=content_type_norm&sort=content_type_norm&order=asc&websiteTitle=BBC%20News&tab=results&action=search&mode=

but I don't understand the 8th result: 'Cynical Chatter from the Underworld: BBC'

peterwebster commented 9 years ago

After looking at the exact phrase, it does seem clear that this is looking at the page title. So, we ought to think about renaming the field to 'Page Title' (not a perfect fit, but less misleading to than Website Title.) What does @anjackson think ?

anjackson commented 9 years ago

Agreed, rename to Page Title or Resource Title.