ropensci / elastic

R client for the Elasticsearch HTTP API
https://docs.ropensci.org/elastic
Other
245 stars 58 forks source link

Using pipelines #226

Closed Jensxy closed 5 years ago

Jensxy commented 5 years ago

I have created a pipeline within R. Now I want to use the pipeline to index documents with additional fields. I pipeline looks like this

body <- '{
 "description" : "Extract attachment information encoded in Base64 with UTF-8 charset",
"processors" : [
{
  "attachment" : {
  "field" : "data"
  }
}
]
}'
pipeline_create(id = "attachment", body = body)

My problem is that I want to index documents with attachments (emails).

So my 2 questions are.

  1. How do I use the pipeline to index documents with additional fields like sender, receiver etc.?
  2. How do I use the pipeline within an array of attachments when I have created a pipeline for an array of attachments?

Version: elastic: elastic_0.8.4.9410

{ "name" : "74Fu38x", "cluster_name" : "elasticsearch", "cluster_uuid" : "lKC9cNz8TEqDGMWUUVzweA", "version" : { "number" : "6.2.2", "build_hash" : "10b1edd", "build_date" : "2018-02-16T19:01:30.685723Z", "build_snapshot" : false, "lucene_version" : "7.2.1", "minimum_wire_compatibility_version" : "5.6.0", "minimum_index_compatibility_version" : "5.0.0" }, "tagline" : "You Know, for Search" }

sckott commented 5 years ago

thanks for the issue @Jensxy

will have a look and get back to you soon.

sckott commented 5 years ago
  1. have you seen these docs https://www.elastic.co/guide/en/elasticsearch/plugins/master/using-ingest-attachment.html seems like you need the properties field to define further fields to index
  2. have you seen these docs https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment-with-arrays.html seems like you need to use the foreach directive?
Jensxy commented 5 years ago

Yes, I have seen these docs, but how do I apply these things within R using the elastic package? That is my problem.

sckott commented 5 years ago

I think you have to define those things in the body of the request, passed to the body parameter.

e.g.

{
  "description": "do a thing",
  "processors": [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "target_field": "_ingest._value.attachment",
            "field": "_ingest._value.data",
            "properties": ["title", "name", "author"]
          }
        }
      }
    }
  ]
}

that's not tested, just geussing at what you'nd need. AFAICT I don't think there's any changes needed in this package, but rather you can do through your requst body

Jensxy commented 5 years ago

Okay, I will try it. Thank you very much.

sckott commented 5 years ago

@Jensxy let me know if you get it to work

Jensxy commented 5 years ago

I've just created a foreach pipeline and then I used

es_PUT(file.path(url = make_url(es_get_auth()),
                     index, "doc_1?pipeline=attachment"),
           body = body, config = es_cfg)

And my body looks like this

{"name": "test_name",
 "place": "test_place",
 "attachments" : [
    {"filename": "test_filename1",
     "data" : "test_data1"},
    {"filename": "test_filename2",
     "data" : "test_data2"}]}

Then everything works fine :)

sckott commented 5 years ago

great! i'll see if I should add anything to docs to show an example

sckott commented 5 years ago

@Jensxy see new function pipeline_attachment() and egs