toshi-search / Toshi

A full-text search engine in rust
MIT License
4.2k stars 131 forks source link

Bulk indexing giving intermittent 400s (due to 100ms timeout?) #788

Open aguynamedben opened 3 years ago

aguynamedben commented 3 years ago

Describe the bug I'm trying to use the _bulk endpoint. I read the tests in the code, and understand that it want's line-by-line JSON as the request body. I got it all working, and I'm trying to index Wikipedia articles that look like this:

{"url":"https://en.wikipedia.org/wiki?curid=48688133","title":"Zhenjiang dialect","body":"\nZhenjiang dialect\n\nZhenjiang dialect is a form of Eastern Mandarin spoken in the town of Zhenjiang in Jiangsu Province. The town is situated on the south bank of the Yangtze river between Nanjing and Changzhou. It is thus at the intersection of China's Mandarin and Wu speaking regions. About 2.7 million Chinese live in the area where the Zhenjiang dialect is predominant.\nIn ancient times, Zhenjiang spoke Wu. Today, Wu is the language of nearby Changzhou, as well as Shanghai and Zhejiang Province. Mandarin speakers from the North have been immigrating to Zhenjiang since the fourth century, gradually changing the character of the local dialect. In modern times, the city speaks a dialect that is transitional between the Eastern Mandarin of Nanjing, located just west of the city, and the Taihu dialect of Wu spoken in Changzhou, which is just east of the city. Zhenjiang dialect is comprehensible to Nanjing residents, but not to Changzhou residents.\nThe issue of tones in the Zhenjiang dialect has been a topic scholarly study. Nanjing residents use the four tones of Mandarin, while Changzhou residents use seven or eight tones. According to a study by Qiu Chunan, Zhenjiang dialect has five citation tones: Tone1 (42) (a sharp fall from pitch 4 to pitch 2, or \"yinping\"), Tone2 (35) (a rising tone or \"yangping\"), Tone3 (32) (slight falling tone or \"shang\"), Tone4 (55) (high even or \"qu\"), and Tone5 (5) (checked tone or \"ru\"). Qiu's study used residents who had grown up in the Daxi Road area, where the standard form of the dialect is said to be spoken. The checked tone was a feature of Chinese spoken in the Middle Ages, but it is not part of Mandarin. Applying the theory of government phonology to the issue, Bao Zhiming noted that non-even tones become even when they appear before the high even, or 55, tone.\n\n","id":"65be5132-c479-474a-b8d6-3ed9bc1a19d9"}

I regularly (but not always) get 400s with no helpful response body. I don't see a panic or any logs in Toshi's stdout. From looking at the bulk_insert handler, it looks like it's probably to the index_documents call at the bottom. It seems that within index_document there is some sort of 100ms timeout. I've seen the error happen after a slight hiccup in my script's output, so I'm wondering if a delay within Toshi is causing the 400s.

It seems that if I use a small batch size, i.e. 5 or 10 records, the timeout is less likely, but I'm trying to insert 5m documents, so I wish to use a batch size of 100 or 1,000, and flush at the end.

Any ideas?

Thanks for sharing this project, it's really, really cool!

To Reproduce Steps to reproduce the behavior:

  1. Bulk insert records quickly.
  2. Watch for error.

Expected behavior 201 Created

Desktop (please complete the following information):

aguynamedben commented 3 years ago

As a workaround, I'm trying the PUT /indexname endpoint with committing turned off, and I'm seeing surprisingly high times it takes to complete each request.

My index PUT /indexname/_create body

[
  {
    "name": "id",
    "type": "text",
    "options": {
      "indexing": {
        "record": "position",
        "tokenizer": "default"
      },
      "stored": true
    }
  },
  {
    "name": "title",
    "type": "text",
    "options": {
      "indexed": true,
      "stored": true
    }
  },
  {
    "name": "body",
    "type": "text",
    "options": {
      "indexed": true,
      "stored": true
    }
  },
  {
    "name": "url",
    "type": "text",
    "options": {
      "indexing": {
        "record": "position",
        "tokenizer": "default"
      },
      "stored": true
    }
  }
]

Example records

(I removed body to see if that was slowing it down, but that didn't really make it go any faster) image

Surprising insertion times

image

Snippet of JavaScript API calls

In this example, I've turned parallelism to 1, so there's only one PUT /indexname going on at a time, with commit: false.

async function postRecord(record) {
  const bodyRemoved = { ...record };
  delete bodyRemoved.body;
  try {
    const t1 = present();
    const response = await axios.put(
      `http://localhost:8080/wikipedia`,
      JSON.stringify({
        options: { commit: false },
        document: bodyRemoved,
      }),
      { headers: { 'Content-Type': 'application/json' } },
    );
    const t2 = present();
    console.log(`done posting record ${record.id} in ${Math.round(t2 - t1)}ms`);
  } catch (error) {
    if (error.response) {
      console.warn(error.response.status);
      console.warn(error.response.body);
      throw error;
    } else {
      throw error;
    }
  }
}

async function postBatch(batch) {
  await pMap(batch, postRecord, { concurrency: 1 });
}

CPU and Memory are bored during this slowness

image image

Any ideas? I'm happy to help debug if you think it's a bug.

hntd187 commented 3 years ago

I appreciate the very detailed report. I'm almost certain this has to do with when I originally wrote the bulk insert, I struggled with how to bubble errors back up to the top and return those for the request while not leaving a ton of open channels and threads floating around from a bad request. I'm almost certain an error is happening during the channel passing.

I noticed immediately that there are unescaped new lines in your json. Can you humor me and try escaping the input so I can rule that out?

As for the slow post times, those definitely seem wrong too. Is the index particularly large when you commit? Or is this an empty index?

aguynamedben commented 3 years ago

I don't have time to keep tinkering with this right now. I was using spare time on a Friday night just to check it out.

The index as empty, and it got locked up pretty soon after I started bulk inserting. Based on what you're saying, it seems like a bug with bulk insert.

I'll let you know if I get around to poking at this again.