netarchivesuite / solrwayback

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.
Apache License 2.0
102 stars 21 forks source link

Is PageRank enabled in SolrWayback? #425

Open cslr opened 1 year ago

cslr commented 1 year ago

Hello

Does SolrWayback implementation have a PageRank algorithm enabled? Based on a small crawl and indexing, it doesn't seem to have pagerank as described by me here (markov chains in mathematical terms):

https://medium.com/codex/better-voting-results-through-markov-chains-d8aa46c79870

I would be interested to implement (better) PageRank if there is no one enabled (as markov chain limiting distribution, need to study existing pagerank algorithms and develop method to approximately calculate markov chain limiting distribution).

thomasegense commented 1 year ago

We use Solr for search and ranking and under the hood Solr uses 'Term frequency' and 'Inverse Document Frequency', which does the job perfect for text searches.

You can see the solrconfig.xml in the solr-folder and find the edismax request handler. The only options are basically boost on fields.

But Solr can also support more complex queries (json facets) and sorting, but sorting by "in-going links" is probably not possible possible, since documents does not have a "counter" for how many links to that page. (And it would have to be updated whenever a new document is indexed).

So mu guess it would require multiple solr-requests and probably use of json facets, so it would be very very slow. Combining both text and this ranking "counter" is the issue. If you have SolrWayback installed you have API access to the solr admin page (localhost:8983/solr) where you can do everything, so it would be possible to try implement this by just some custom code that calls the API. But for a large corpus (300M+ documents) there will be a huge performance issue.

There are json facets call in 'NetarchiveSolrClient' class in the project if you want to see how the API is called.

The gephi graph in the toolbox can make a visualization of a search result with "page-rank', but this require you to export a CSV and use gephi. Gephi can use the page-ranking to increase size of the node.But this is long multi step process. Here is a visualization of a whole year (query is 'crawl_year:2018' etc, but can be any text). https://labs.statsbiblioteket.dk/linkgraph/2018/ The all the tools you need is in solr, but performance is the big issue. So having it as a search-option in SolrWayback is very unlikely.

@tokee can have the final say, he is the expert in Solr.

thomasegense commented 1 year ago

@cslr A complication is that you also probably have to look at unique urls. Some URL's we harvest every 10 minutes (online news). Those site will generate an enourmous pagerank for pages they are linking to, and only due to the fact that we have harvest them 1000 of times more than a wordpress page etc.

tokee commented 1 year ago

First of all, the giant inherent problem of improving search for netarchives is that a websites has been trying to cheat Google and other search engines in that game for more than a decade. That is not to say that the current TF/IDF-approach cannot be improved, but whenever a more sophisticated approach is introduced, chances are that some spam-sites has already optimized towards that and will be part of most search results. TF/IDF is vulnerable to keyword injection, but at least that only works for the particular subject being injected, not "everything" as PageRank does.

TF/IDF is a first-level whole-corpus ranking: It does rely on knowledge of the terms in the whole corpus, but that knowledge is immediately available as the corpus is being build.

PageRank is a second-level whole-corpus ranking: Each node (page) has a weight that contributes to other nodes through the edge between them and each weight is again derived from edges. To get the mathematically correct PageRank we would have to resolve and traverse the full link-graph from every node. In reality it is a lot less work, and the Markov Chains approach that @cslr points too seems to be one variant. However, as @thomasegense describes it would still be very expensive to do at query time on the current index structure.

To make PageRank work effectively for search, the weights must be pre-computed. Solr does support efficient in-place updating of numeric fields so depending of the sophistication level of the PageRank it could be done as one or more sweeps over the index. A boosting query could then use the ranks to affect the scoring for search.

The Royal Danish Library "freezes" the overall index in chunks and depending on the specifics, it might not be possible to perform in-place updates (we'll have to check). For a fully live index it will work fine.

Having a netarchive focused PageRank engine would potentially be a great contribution. "Potentially" because of the caveat with spam sites mentioned above. As the work is 95% on the index side, this would also benefit other users of the webarchive-discovery indexer.

thomasegense commented 1 year ago

The paper you listed still has the weakness that 'fake' links will still have huge effect even though are are penalized/averaged with the markov chain approach.

The linkgraph I shared was made with Gephi and node-size was calculated from "Betweeness centrality'". See: https://neo4j.com/docs/graph-data-science/current/algorithms/betweenness-centrality/#:~:text=Betweenness%20centrality%20is%20a%20way,of%20nodes%20in%20a%20graph.

Boosting this rankin is hard and can not manipulated by setting up a lot of isolated link portals.

Implementing the algorithm in the paper may not give the outcome you hope for.

anjackson commented 1 year ago

I'm not sure if this would work directly in Solr, or be part of webarchive-discovery, or (perhaps more likely) as an intervening stage. e.g. taking the JSON from the warc-indexer and processing it before sending it to Solr with a populated PageRank field. I imagine it will depend on the details of the ranking algorithm.

Along with those raised above, some additional issues that spring to mind:

  1. Web archives have a time axis. It's not clear if PageRank should be calculated over the whole time range of a collection, or broken down into narrower periods. Does it make sense to use the 2023 PageRank for a document from 1996? I think not, but then what should the temporal range be? Should longer-lived URLs contribute more than short-lived ones? How?
  2. Except for perhaps the Internet Archive, any archive is a very small portion of the global web. As such, the resulting PageRank from an archival collection would likely be very different to the results of ranking across the whole global web. Given the global distribution of links across hosts is more like a power law than a flat line, it is difficult to predict what taking a slice of it will do.
  3. If we don't have the coverage to do meaningful URL-to-URL ranking, would some kind of aggregation be more helpful? e.g. could we count the number of links between hosts and use a kind of 'HostRank' to boost results?

It would be good to have some kind of implementation that would allow these issues to be explored.

That said, one unfortunate limitation is that unlike some Information Retrieval fields, we don't have any kind of evaluation corpus of what 'good search results' look like. So I'm not sure how you'd tell if any particular choice of ranking factors is better than another. Depending on your background/context, it might make sense to apply for funding from the IIPC (https://netpreserve.org/projects/) and try to find IIPC members who would be willing to evaluate the results.