typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

Collection which is latest was deleted after scraper completed #64

Closed ruoqianfengshao closed 3 months ago

ruoqianfengshao commented 3 months ago

Description

The newest collection was deleted after scraper completed, the scraper log show Nb hits: 172054 image

when the scraper running, I can find collection with nb documents increasing image

but after scraper completed, the collection was disappeared.

Steps to reproduce

my config as below:

{
    "index_name": "trantor",
    "stop_urls": [
        "https://trantor-docs.app.terminus.io/v1.x/sandbox",
        "https://trantor-docs.app.terminus.io/v0.14.x/sandbox",
        "https://trantor-docs.app.terminus.io/v0.17.x/sandbox",
        "http://trantor.static.terminus.io",
        "http://127.0.0.1:8080",
        "http://terminus-paas.oss-cn-hangzhou.aliyuncs.com",
        "http://terminus-trantor.oss-cn-hangzhou.aliyuncs.com",
        "http://mvel.documentnode.com",
        "http://www.w3.org",
        "http://trantor.terminus.io",
        "http://overwride-oss.oss-cn-hangzhou.aliyuncs.com",
        "https://trantor-community.app.terminus.io"
    ],
    "start_urls": [
        {
            "url": "https://trantor-docs.app.terminus.io/v1.x/doc"
        },
        {
            "url": "https://trantor-docs.app.terminus.io/v0.14.x/doc"
        },
        {
            "url": "https://trantor-docs.app.terminus.io/v0.17.x/doc"
        }
    ],
    "selectors": {
        "lvl0": ".ant-page-header-heading-title",
        "lvl1": ".ant-card-body h1",
        "lvl2": ".ant-card-body h2",
        "lvl3": ".ant-card-body h3",
        "lvl4": ".ant-card-body h4",
        "lvl5": ".ant-card-body h5",
        "lvl6": ".ant-card-body h6",
        "text": ".ant-card-body p, ant-card-body a, .ant-card-body li, .ant-card-body td, .ant-card-body code span, .antd-card-body pre code, .antd-card-body *"
    },
    "js_render": true,
    "js_wait": 5,
    "nb_hits": 2000000000000
}

Expected Behavior

The correct collection work.

Actual Behavior

image

The correct collection was deleted.

Metadata

Typesense Version: 0.25.2 Typesense scraper Version: 0.9.1

jasonbosco commented 3 months ago

This could happen if the alias in Typesense was somehow pointing to the new collection, before the scraper was done.

Because the scraper at the end will look up what the old collection name was using the alias, then point the alias to the new collection and delete the collection

ruoqianfengshao commented 3 months ago

@jasonbosco Thanks for explaining. I delete all aliases and run scraper again, then it looks fine.