Open aguynamedben opened 3 years ago
As a workaround, I'm trying the PUT /indexname
endpoint with committing turned off, and I'm seeing surprisingly high times it takes to complete each request.
PUT /indexname/_create
body[
{
"name": "id",
"type": "text",
"options": {
"indexing": {
"record": "position",
"tokenizer": "default"
},
"stored": true
}
},
{
"name": "title",
"type": "text",
"options": {
"indexed": true,
"stored": true
}
},
{
"name": "body",
"type": "text",
"options": {
"indexed": true,
"stored": true
}
},
{
"name": "url",
"type": "text",
"options": {
"indexing": {
"record": "position",
"tokenizer": "default"
},
"stored": true
}
}
]
(I removed body to see if that was slowing it down, but that didn't really make it go any faster)
In this example, I've turned parallelism to 1, so there's only one PUT /indexname
going on at a time, with commit: false
.
async function postRecord(record) {
const bodyRemoved = { ...record };
delete bodyRemoved.body;
try {
const t1 = present();
const response = await axios.put(
`http://localhost:8080/wikipedia`,
JSON.stringify({
options: { commit: false },
document: bodyRemoved,
}),
{ headers: { 'Content-Type': 'application/json' } },
);
const t2 = present();
console.log(`done posting record ${record.id} in ${Math.round(t2 - t1)}ms`);
} catch (error) {
if (error.response) {
console.warn(error.response.status);
console.warn(error.response.body);
throw error;
} else {
throw error;
}
}
}
async function postBatch(batch) {
await pMap(batch, postRecord, { concurrency: 1 });
}
Any ideas? I'm happy to help debug if you think it's a bug.
I appreciate the very detailed report. I'm almost certain this has to do with when I originally wrote the bulk insert, I struggled with how to bubble errors back up to the top and return those for the request while not leaving a ton of open channels and threads floating around from a bad request. I'm almost certain an error is happening during the channel passing.
I noticed immediately that there are unescaped new lines in your json. Can you humor me and try escaping the input so I can rule that out?
As for the slow post times, those definitely seem wrong too. Is the index particularly large when you commit? Or is this an empty index?
I don't have time to keep tinkering with this right now. I was using spare time on a Friday night just to check it out.
The index as empty, and it got locked up pretty soon after I started bulk inserting. Based on what you're saying, it seems like a bug with bulk insert.
I'll let you know if I get around to poking at this again.
Describe the bug I'm trying to use the _bulk endpoint. I read the tests in the code, and understand that it want's line-by-line JSON as the request body. I got it all working, and I'm trying to index Wikipedia articles that look like this:
I regularly (but not always) get 400s with no helpful response body. I don't see a panic or any logs in Toshi's stdout. From looking at the bulk_insert handler, it looks like it's probably to the index_documents call at the bottom. It seems that within index_document there is some sort of 100ms timeout. I've seen the error happen after a slight hiccup in my script's output, so I'm wondering if a delay within Toshi is causing the 400s.
It seems that if I use a small batch size, i.e. 5 or 10 records, the timeout is less likely, but I'm trying to insert 5m documents, so I wish to use a batch size of 100 or 1,000, and flush at the end.
Any ideas?
Thanks for sharing this project, it's really, really cool!
To Reproduce Steps to reproduce the behavior:
Expected behavior 201 Created
Desktop (please complete the following information):