jprante / elasticsearch-knapsack

Knapsack plugin is an import/export tool for Elasticsearch
Apache License 2.0
472 stars 77 forks source link

RoutingMissingException for import of child documents #58

Open asanderson opened 9 years ago

asanderson commented 9 years ago

Importing child documents results in an org.elasticsearch.action.RoutingMissingException, since the parent id is required for POSTing child documents. Perhaps you could use some directory convention on export whereby all child records are written to the directory of their parent or something like that.

jprante commented 9 years ago

Import/export of parent/child is a bit complex. First, the parents have to be exported and imported, then the children. Mixed import/export will lead to the missed routings.

asanderson commented 9 years ago

Actually, you can import the child docs before or without the parent docs without ES throwing an exception.

Regardless, our ingest processes use a directory convention where by we create the parent dir using the parent id and then create both the parent and child docs in that dir so that we can index the parent first and then it's children. You could do something similar.

A. Steven Anderson On Oct 26, 2014 5:12 AM, "Jörg Prante" notifications@github.com wrote:

Import/export of parent/child is a bit complex. First, the parents have to be exported and imported, then the children. Mixed import/export will lead to the missed routings.

Reply to this email directly or view it on GitHub https://github.com/jprante/elasticsearch-knapsack/issues/58#issuecomment-60511232 .

agonen commented 8 years ago

I'm also getting this problem . so can I assume this issue is still open ?

jprante commented 8 years ago

I do not fully understand the issue. Maybe because I do not use parent/child.

The import/export of parent/child is complex and requires two exports and imports.

  1. Export of parents
  2. Import of parents
  3. Export of children
  4. Import of children

I can not see where I can use a "directory convention" and how this should be different from current export/import. Maybe I'm missing something completely.

asanderson commented 8 years ago

The solution is simple. It just comes down to a standard file naming convention for child records which indicate their parent's _id which is required for the child ingest request. e.g. we use the convention that child records are stored in a sub-directory of the parent directory, so that we can determine the parent _id, which is the name of the parent directory. However, you could also use some other child file naming convention like prefixing the child file name with the parent _id or something similar.

marbleman commented 8 years ago

Having exactly the same issue atm.

@jprante: I'll give your suggestion of multiple steps a shot but my concern is: I cannot see the _parent field in my archived data and I am afraid without having _parent referencing to _id of our parent, we cannot index any child document

asanderson commented 8 years ago

e.g. parent doc _id = 1111111111 w/ child doc _id = 9999999999 and parent doc _id = 2222222222 w/ no child docs.

So, you could export them to something like the following directory structure:

myindex/
     1111111111/
           9999999999.json
     1111111111.json
     2222222222.json

So, when you go to index the child records, you can get the _id of the parent from it's file path.

Alternatively, you could just use a file naming convention w/ a delimiter between parent and child ids if you don't want to use subdirectories.

myindex/
     1111111111.json
     1111111111_9999999999.json
     2222222222.json
marbleman commented 8 years ago

@asanderson: in our case the children are located in different types (one type containing the parent and multiple types containing children)

However, your idea could still work if we can assure that any indices not having a parent mapping get restored first.

@jprante: your suggestion of using multiple steps seems not to work out because _parent field is located at the same level as _id and not part of the archived child type (at least not in mine)

asanderson commented 8 years ago

I'm not sure what you mean. Are you saying, you cannot determine the _parent field for a child document during an export?

asanderson commented 8 years ago

Well, actually my example above is an over-simplification. We do have support multiple types. e.g.

myindex1/
     mytype1/
          1111111111/
                9999999999.json
          1111111111.json
          2222222222.json

And, we index all the parents first and then all the children; i.e. breadth-first traversal.

marbleman commented 8 years ago

Seems I am missing the point... How do you do that? For what I see, knappsack ignores the _parent field currently

asanderson commented 8 years ago

@marbleman Well, that's sort of the point of this issue; i.e. fully support export/import of parent & child documents.

marbleman commented 8 years ago

Ah ok... you sounded to me as if you have already solved this issue.

To sum up: a structure like

myindex1/
     mytype1/
          1111111111/
                9999999999.json
          1111111111.json
          2222222222.json

where 1111111111 is the value of _parent off document 9999999999 will to the trick during export. During import types without having a _routing and/or _parent definition in their mapping must be imported first. For any child type _parent can then be retrieved from the folder name.

Does not sound too complex but I am not a Java developer... To counterbalance I could donate a box of good wine ;)

asanderson commented 8 years ago

Just to clarify, this is the scheme our bulk ingest uses for our system; i.e. our various data sources are extracted/transformed/loaded (ETL) into .json files in this directory structure for our bulk ingest process to read them and send them to Elasticsearch. However, it would be great to have knapsack implement a similar scheme to support parent/child docs.

jprante commented 8 years ago

Seems I have to document better how to export/import meta fields like _parent, beside _source. It's related to the syntax of an ES query.

jperkelens commented 8 years ago

Hi all, is this supported in 2.1.1.0? AFAIK, I'm following the necessary steps to import child documents (e.g. the generated export file has _source, _parent, _routing entries) but the import is failing with the following:

index [myindex], type [mytype], id [3059310], message [[myindex] RoutingMissingException[routing is required for [myindex]/[mytype]/[3059310]]]
marbleman commented 8 years ago

don't miss the steps to EXPORT the child documents correctly! If you don't add the _parent field to the export with the described query, you cannot import it. If you inspect the archive, you should see a _parent folder next to the _source folder containing the data for each document.

jperkelens commented 8 years ago

What I'm seeing is a _parent, _source, and _routing file rather than a folder inside each document folder. Again, I'm using v 2.1.1.0, in case it matters when I run an ls under the index and type the results are the following

index/type/id1/_source
index/type/id1/_parent
index/type/id1/_routing
index/type/id2/_source
index/type/id2/_parent
index/type/id2/_routing
etc...

and if i cat the files the _source file has the document json and the _parent and _routing files simply have the parent id in them

marbleman commented 8 years ago

That seems to be correct. Make sure your import the parent type first

jperkelens commented 8 years ago

I have imported the parent type and am having no problem indexing child documents manually, however the import continues to fail