readthedocs / readthedocs.org

The source code that powers readthedocs.org
https://readthedocs.org/
MIT License
8.05k stars 3.59k forks source link

Refresh search index after pages have been (re)moved #2013

Closed prashanthpai closed 6 years ago

prashanthpai commented 8 years ago

Details

Gluster uses RTD to host it's documentation. We noticed that search results points to old pages that have been removed or moved. How can the search index be rebuilt to reflect actual pages in repo ?

Example search query: https://readthedocs.org//search/?q=DHT&check_keywords=yes&area=default&project=gluster&version=latest&type=file

The results of the above search query contains links that are dead because pages have been removed.

Thanks.

agjohnson commented 8 years ago

There is some code that should be updating the indexes -- including deleting removed pages -- but there should be a nuclear option here as well, to rebuild the index.

prashanthpai commented 8 years ago

@agjohnson If I understand correctly, the part of code that should update the index is broken (bug) and the nuclear option to rebuild the index from scratch is an enhancement (workaround for the bug) targeted for the future ?

Search is very important to users and documents keep changing all the time and index should reflect it, at least eventually if rebuilding index is an expensive backend operation.

Thanks :+1:

agjohnson commented 8 years ago

The index is updated as expected -- that is, all updated files get updated in the search index -- but we need to make some effort to detect deleted files in the repo and remove them from the index. This is a missing feature currently.

I say rebuild the index, but I meant wiping the index of the project + version build, and updating the index with the new build. This might be the most resilient way around this, deletion deletion might be hard to ensure.

shaunix commented 8 years ago

There seems to be code to do this here:

https://github.com/rtfd/readthedocs.org/blob/master/readthedocs/restapi/utils.py#L157

Along with a TODO that indicates it's untested. But my reading is that delete is set to False here:

https://github.com/rtfd/readthedocs.org/blob/master/readthedocs/core/management/commands/reindex_elasticsearch.py#L50

Set to False in this commit, but without much explanation on why:

https://github.com/rtfd/readthedocs.org/commit/1d422dcf8446a4773d3a28a99409a9e71597c12b

Any tips on how I can help getting this working correctly?

prashanthpai commented 8 years ago

Hi all, any update on this ?

Search is a very important functionality to all gluster users. We've had users (recently by @monotek) repeatably bring this up.

We're even contemplating converting all our docs from markdown to .rst to get rid of mkdocs and use sphinx which I believe has search built in. But this conversion is a humongous task that will need manual inspection despite tools available for such conversion.

RTD has been working well for GlusterFS mini-project libgfapi-python which uses .rst and sphinx.

It would really be helpful if an estimate can be provided when the broken search can be fixed or if it'll be fixed at all.

ericholscher commented 6 years ago

Just want to make sure that we address this issue with our implementation. I believe our prototype using Elastic Search 6 (#4183) will fix this issue, but want to confirm that it will so bringing it up here.

safwanrahman commented 6 years ago

I strongly bet that the search index get automatically removed as soon as the file is removed. I will add a test to make sure it works perfectly

ericholscher commented 6 years ago

This has been fixed in our new search code. It will be deployed in the next month or so, so closing this issue as it's been addressed.

safwanrahman commented 6 years ago

There was a bug in removing the index after file is removed. Fixed it in #4277. Thanks @prashanthpai for filling the issue.

jcampbell commented 5 years ago

I continue to observe this behavior (search results including deleted pages) in our hosted docs. Is there something that might account for that and/or could I provide useful diagnostic information to help identify/resolve the issue?

stsewd commented 5 years ago

@jcampbell please see https://github.com/readthedocs/readthedocs.org/issues/6069

jcampbell commented 5 years ago

Thanks for flagging that @stsewd : I have tried wiping and rebuilding, but to no effect (moved pages still show up twice in search results, with one link being broken); will comment on that issue.