Closed vanderstaaij closed 10 years ago
If i read and understood well your search for "ambiguous" in the imapriverdata index succeeded (which means attachment data is searched correctly) but the hits.hits result document contains base64 encoded data? Or does your search for "ambiguous" does not work at all?
Can you try to omit "type" : "nested" for the attachments mapping to see if this is the cause? Pls also have a look here: https://github.com/salyh/elasticsearch-river-imap/issues/10#issuecomment-50125929
I neither of the above work pls. post (maybe at http://pastebin.com) the output of
http://elasticsearchserver:9200/test/_mapping?pretty=true
and
http://elasticsearchserver:9200/imapriverdata/_mapping?pretty=true
because the only problem i can think of (currently) is a incorrect mapping.
Yes, the search for "ambiguous" in the imapriverdata index succeeds and it shows base64 data instead of the text.
As per your suggestion I removed the "nested" type for attachments, as such as that my river configurations looks like this:
{
"type":"imap",
"mail.store.protocol":"imap",
"mail.imap.host":"imap.gmail.com",
"mail.imap.port":993,
"mail.imap.ssl.enable":true,
"mail.imap.connectionpoolsize":"3",
"mail.debug":"false",
"mail.imap.timeout":10000,
"user":"***PRIVATE***",
"password":"***PRIVATE***",
"schedule":null,
"interval":"60s",
"threads":5,
"folderpattern":"^INBOX$",
"bulk_size":100,
"max_bulk_requests":"30",
"bulk_flush_interval":"5s",
"mail_index_name":"imapriverdata",
"mail_type_name":"mail",
"with_striptags_from_textcontent":true,
"with_attachments":true,
"with_text_content":true,
"with_flag_sync":false,
"index_settings" : {
"index": {
"analysis": {
"analyzer": {
"email_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "uax_url_email"
},
"fulltext_analyzer_icu": {
"type": "custom",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"type_as_payload"
],
"tokenizer": "icu_tokenizer"
}
}
},
"index": {
"number_of_replicas": "1",
"number_of_shards": "5"
}
}
},
"type_mapping" : {
"mail": {
"dynamic" : "strict",
"properties": {
"attachmentCount": {
"type": "long"
},
"attachments": {
"properties": {
"content": {
"type" : "attachment",
"path" : "full",
"fields" : {
"file" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets"
},
"name" : {"store" : "yes"},
"title" : {"store" : "yes"},
"date" : {"store" : "yes"},
"author" : {"analyzer" : "fulltext_analyzer_icu"},
"keywords" : {"store" : "yes"},
"content_type" : {"store" : "yes"},
"content_length" : {"store" : "yes"},
"language" : {"store" : "yes"}
}
},
"contentType": {
"type": "string"
},
"filename": {
"type": "string"
},
"name": {
"type": "string"
},
"size": {
"type": "long"
}
}
},
"contentType": {
"type": "string"
},
"flaghashcode": {
"type": "integer"
},
"flags": {
"type": "string"
},
"folderFullName": {
"type": "string",
"index": "not_analyzed"
},
"folderUri": {
"type": "string"
},
"from": {
"properties": {
"email": {
"type": "string"
},
"personal": {
"type": "string"
}
}
},
"headers": {
"type" : "nested",
"properties": {
"name": {
"type": "string"
},
"value": {
"type": "string"
}
}
},
"mailboxType": {
"type": "string"
},
"receivedDate": {
"type": "date",
"format": "basic_date_time"
},
"sentDate": {
"type": "date",
"format": "basic_date_time"
},
"size": {
"type": "long"
},
"subject": {
"type": "string"
},
"textContent": {
"type": "string"
},
"to": {
"type" : "nested",
"properties": {
"email": {
"type": "string",
"index_analyzer": "email_analyzer"
},
"personal": {
"type": "string"
}
}
},
"uid": {
"type": "long"
}
}
}
}
}
I've dropped the imapriverdata index, which got rebuilt by the imap-river and it downloads the mail including attachment. Next I am able to search (had to change the search query as well from nested query to normal:
http://localhost:9200/imapriverdata/mail/_search:
{
"fields": [ "attachments.content" ],
"query" : {
"match" : {
"attachments.content" : "ambiguous"
}
}
}
The result is the same as with the issue I described earlier, i.e. the base64 content is shown and not the text contents of the file.
Here is the output of http://localhost:9200/imapriverdata/_mapping:
{
"imapriverdata": {
"mappings": {
"imapriverstate": {
"properties": {
"exists": {
"type": "boolean"
},
"folderUrl": {
"type": "string"
},
"lastCount": {
"type": "long"
},
"lastIndexed": {
"type": "long"
},
"lastSchedule": {
"type": "long"
},
"lastTook": {
"type": "long"
},
"lastUid": {
"type": "long"
},
"uidValidity": {
"type": "long"
}
}
},
"mail": {
"dynamic": "strict",
"properties": {
"attachmentCount": {
"type": "long"
},
"attachments": {
"properties": {
"content": {
"type": "attachment",
"path": "full",
"fields": {
"content": {
"type": "string"
},
"author": {
"type": "string",
"analyzer": "fulltext_analyzer_icu"
},
"title": {
"type": "string",
"store": true
},
"name": {
"type": "string",
"store": true
},
"date": {
"type": "date",
"store": true,
"format": "dateOptionalTime"
},
"keywords": {
"type": "string",
"store": true
},
"content_type": {
"type": "string",
"store": true
},
"content_length": {
"type": "integer",
"store": true
},
"language": {
"type": "string",
"store": true
}
}
},
"contentType": {
"type": "string"
},
"filename": {
"type": "string"
},
"name": {
"type": "string"
},
"size": {
"type": "long"
}
}
},
"contentType": {
"type": "string"
},
"flaghashcode": {
"type": "integer"
},
"flags": {
"type": "string"
},
"folderFullName": {
"type": "string",
"index": "not_analyzed"
},
"folderUri": {
"type": "string"
},
"from": {
"properties": {
"email": {
"type": "string"
},
"personal": {
"type": "string"
}
}
},
"headers": {
"type": "nested",
"properties": {
"name": {
"type": "string"
},
"value": {
"type": "string"
}
}
},
"mailboxType": {
"type": "string"
},
"receivedDate": {
"type": "date",
"format": "basic_date_time"
},
"sentDate": {
"type": "date",
"format": "basic_date_time"
},
"size": {
"type": "long"
},
"subject": {
"type": "string"
},
"textContent": {
"type": "string"
},
"to": {
"type": "nested",
"properties": {
"email": {
"type": "string",
"index_analyzer": "email_analyzer"
},
"personal": {
"type": "string"
}
}
},
"uid": {
"type": "long"
}
}
}
}
}
}
This is the mapping output of my test:
{
"test": {
"mappings": {
"mapper": {
"dynamic": "strict",
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"file": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets"
},
"author": {
"type": "string",
"analyzer": "fulltext_analyzer_icu"
},
"title": {
"type": "string",
"store": true
},
"name": {
"type": "string",
"store": true
},
"date": {
"type": "date",
"store": true,
"format": "dateOptionalTime"
},
"keywords": {
"type": "string",
"store": true
},
"content_type": {
"type": "string",
"store": true
},
"content_length": {
"type": "integer",
"store": true
},
"language": {
"type": "string",
"store": true
}
}
}
}
}
}
}
}
Apologies - pressed the wrong button to comment ;)
Can you try to change
"file" : { "type" : "string", "store" : true, "term_vector" : "with_positions_offsets" },
to
"content" : { "type" : "string", "store" : true, "term_vector" : "with_positions_offsets" },
in
"type_mapping" : { "mail": { "dynamic" : "strict", "properties": { "attachmentCount": { "type": "long" }, "attachments": { "properties": { "content": { "type" : "attachment", "path" : "full", "fields" : { "file" : { "type" : "string", "store" : true, "term_vector" : "with_positions_offsets" }, "name" : {"store" : "yes"}, "title" : {"store" : "yes"}, "date" : {"store" : "yes"}, "author" : {"analyzer" : "fulltext_analyzer_icu"}, "keywords" : {"store" : "yes"}, "content_type" : {"store" : "yes"}, "content_length" : {"store" : "yes"}, "language" : {"store" : "yes"} } },
Done that, deleted the imapriverdata index and re-downloaded the mail again at which the index imapriverdata got recreated. After this change I do not get results when searching the content field for the word "ambiguous" as described in the issue above here.
When searching for all entries in the index, I do see this (including base64 content again):
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "imapriverdata",
"_type": "mail",
"_id": "4::imap://***PRIVATE***%40gmail.com@imap.gmail.com/INBOX",
"_score": 1,
"_source": {
"attachmentCount": 1,
"attachments": [
{
"content": "UEsDBBQABgAIAAAAIQDtO5Wb8wEAAFoKAAATAAgCW ...(shortened)... AAAAAAAAAAAAAER3AAB3b3JkL3dlYlNldHRpbmdzLnhtbFBLAQItABQABgAIAAAAIQAQKFQ4agEAAJICAAARAAAAAAAAAAAAAAAAAER5AABkb2NQcm9wcy9jb3JlLnhtbFBLBQYAAAAAHwAfAB8IAADlewAAAAA=",
"contentType": "APPLICATION/VND.OPENXMLFORMATS-OFFICEDOCUMENT.WORDPROCESSINGML.DOCUMENT; \r\n\tname=test.DOCX",
"size": 33818,
"name": "test.DOCX",
"filename": "test.DOCX"
}
],
"bcc": null,
"cc": null,
"contentType": "multipart/MIXED; boundary=f46d043892c58e03ba0501986c23",
"flaghashcode": 0,
"flags": [],
"folderFullName": "INBOX",
"folderUri": "imap://***PRIVATE***%40gmail.com@imap.gmail.com/INBOX",
"from": {
"email": "***PRIVATE***@gmail.com",
"personal": "***PRIVATE***"
},
"headers": [
{
"name": "Subject",
"value": "Test file"
},
{
"name": "Return-Path",
"value": "<***PRIVATE***@gmail.com>"
},
{
"name": "To",
"value": "***PRIVATE***@gmail.com"
},
{
"name": "MIME-Version",
"value": "1.0"
},
{
"name": "Message-ID",
"value": "<CAJbSFbaaRnaZm2jH+ph7Oi2cN9=5=rWaOAU5RNBwdidoz8-6DA@mail.gmail.com>"
},
{
"name": "Authentication-Results",
"value": "mr.google.com; spf=pass (google.com: domain of\r\n ***PRIVATE***@gmail.com designates 10.180.92.73 as permitted sender)\r\n smtp.mail=***PRIVATE***@gmail.com; dkim=pass header.i=@gmail.com"
},
{
"name": "Delivered-To",
"value": "***PRIVATE***@gmail.com"
},
{
"name": "DKIM-Signature",
"value": "v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;\r\n h=mime-version:date:message-id:subject:from:to:content-type;\r\n bh=eWDEX8mCmFKppDgjUGRdYKBEf5QGp2VY2Llo+LagvOc=;\r\n b=mbD94FIzWEW3Mg64FA2lBr7LAwZbN/hNFBW/CyT7J3I7X92ItfmK/4GZhNngfkmbak\r\n E9Fdvy9GJYikgfr4gf0eVUkQ6IvkFfO3H5KP30NE6WxM4IVU+UzsM6z0hQ79Baru2ikc\r\n Duat1mmz3GrsKOWlGRVKqIdtHGCisueGsWbvrmUgsxCq5794mitL8rqIEgtecJA6nLtQ\r\n 1u8FIceR+KXba7AO3erYJLJgzGIXGnThrcnVbUN2ZotfThxjmGUaxShmKg92WK2Rlc4V\r\n Qejziug71lK8atVymVpKiTMSzP/bNurmpUirRQGuu9cjGxvXuPoduoLzOYpEW0hOZuyr RuMQ=="
},
{
"name": "Date",
"value": "Wed, 27 Aug 2014 10:44:11 +0200"
},
{
"name": "Received-SPF",
"value": "pass (google.com: domain of ***PRIVATE***@gmail.com\r\n designates 10.180.92.73 as permitted sender) client-ip=10.180.92.73"
},
{
"name": "Received",
"value": "by 10.96.92.227 with SMTP id cp3csp449875qdb; Wed, 27 Aug 2014\r\n 01:44:11 -0700 (PDT)"
},
{
"name": "Content-Type",
"value": "multipart/mixed; boundary=f46d043892c58e03ba0501986c23"
},
{
"name": "X-Received",
"value": "from mr.google.com ([10.180.92.73]) by 10.180.92.73 with SMTP id\r\n ck9mr27229983wib.54.1409129051462 (num_hops = 1); Wed, 27 Aug 2014 01:44:11\r\n -0700 (PDT)"
},
{
"name": "From",
"value": "***PRIVATE*** <***PRIVATE***@gmail.com>"
}
],
"mailboxType": "IMAP",
"popId": null,
"receivedDate": 1409129051000,
"sentDate": 1409129051000,
"size": 48808,
"subject": "Test file",
"textContent": "\r\n",
"to": [
{
"email": "***PRIVATE***@gmail.com",
"personal": null
}
],
"uid": 4
}
}
]
}
}
Looking at issue #10 the mappings for attachments are different from the mappings for attachments as described with the mapper-attachments plugin documentation. To me this was confusing when configuring the mappings. Perhaps a 1-on-1 similarity of mappings with the mapper-attachments plugin would help configuring this properly.
I have also migrated a test environment to ElasticSearch 1.3.2, latest imap-river and mapper-attachments, to test if it happens there as well, but there the compatible mapper-attachments plugin has an issue with extracting text from Word documents, which is being fixed in the beginning of next week.
I'm seeing the same issue vanderstaaij is describing in the last comment. I'm not getting any hits when searching on terms that is in the attachements, and when searching all entries, I'm seeing the base64 encoded data. My setup is a bit simpler I think, mostly using the default settings. I'm also using ElasticSearch 1.3.2. with the latest imap-river and mapper-attachments
Below the settings for my index:
GET /_river/avropindexer/_search
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "_river",
"_type": "avropindexer",
"_id": "_meta",
"_score": 1,
"_source": {
"type": "imap",
"mail.store.protocol": "imap",
"mail.imap.host": "sedc1exch02",
"mail.imap.port": 993,
"mail.imap.ssl.enable": true,
"mail.imap.connectionpoolsize": "3",
"mail.debug": "false",
"mail.imap.timeout": 10000,
"user": "**********",
"password": "**********",
"schedule": null,
"interval": "60s",
"threads": 10,
"folderpattern": "INBOX",
"bulk_size": 100,
"max_bulk_requests": "30",
"bulk_flush_interval": "5s",
"mail_index_name": "imapriverdata",
"mail_type_name": "mail",
"with_striptags_from_textcontent": true,
"with_attachments": true,
"with_text_content": true,
"with_flag_sync": false,
"index_settings": null,
"type_mapping": {
"mail": {
"properties": {
"attachmentCount": {
"type": "long"
},
"attachments": {
"properties": {
"_indexed_chars": -1,
"content": {
"type": "attachment"
},
"contentType": {
"type": "string"
},
"fileName": {
"type": "string"
},
"size": {
"type": "integer"
}
}
},
"bcc": {
"properties": {
"email": {
"type": "string"
},
"personal": {
"type": "string"
}
}
},
"cc": {
"properties": {
"email": {
"type": "string"
},
"personal": {
"type": "string"
}
}
},
"contentType": {
"type": "string"
},
"flaghashcode": {
"type": "integer"
},
"flags": {
"type": "string"
},
"folderFullName": {
"type": "string",
"index": "not_analyzed"
},
"folderUri": {
"type": "string"
},
"from": {
"properties": {
"email": {
"type": "string"
},
"personal": {
"type": "string"
}
}
},
"headers": {
"properties": {
"name": {
"type": "string"
},
"value": {
"type": "string"
}
}
},
"mailboxType": {
"type": "string"
},
"receivedDate": {
"type": "date",
"format": "basic_date_time"
},
"sentDate": {
"type": "date",
"format": "basic_date_time"
},
"size": {
"type": "long"
},
"subject": {
"type": "string"
},
"textContent": {
"type": "string"
},
"to": {
"properties": {
"email": {
"type": "string"
},
"personal": {
"type": "string"
}
}
},
"uid": {
"type": "long"
}
}
}
}
}
},
{
"_index": "_river",
"_type": "avropindexer",
"_id": "_status",
"_score": 1,
"_source": {
"node": {
"id": "5h8baOdYSFaZyx1nnUVGww",
"name": "D'Ken",
"transport_address": "inet[172.17.5.157/172.17.5.157:9300]"
}
}
}
]
}
}
Pls note that mapper-attachments latest version (2.3.1) has issues as well. See: https://github.com/elasticsearch/elasticsearch-mapper-attachments/issues/82
I'm seeing the same issues for PDFs aswell
have added several testcases upon your configs, They work well: https://github.com/salyh/elasticsearch-river-imap/commit/d4e71e23a26e4edf8472df3246ad5884141869a5
Can you try to build reproducible test cases for your troubles and share them?
I'm sorry, but I'm not a Java developer and I wouldn't have a clue how to write and test a testcase. I do know how to read a little source code, but I'm also not knowledgable in Java.
For as far as I can read the testcases you wrote; what I can read from them is that you test if a query for a certain string in the content field returns a value. This I know already works in my situation, but I want to see the text content instead of base64. Do you also check in there if the content returned from ElasticSearch truly contains a readable string value (text of the attachment) instead of base64? I'm unable to check that.
pls look here http://tinyurl.com/nbujv7h The attachment mapper seems to store either the base64 encoded and the extracted content.
Testcases check here for extracted values: https://github.com/salyh/elasticsearch-river-imap/commit/d4e71e23a26e4edf8472df3246ad5884141869a5#commitcomment-7619337
This mapping https://github.com/salyh/elasticsearch-river-imap/commit/d4e71e23a26e4edf8472df3246ad5884141869a5#commitcomment-7619353 works with a query like:
esSetup.client().prepareSearch("imapriverdata").addFields("attachments.content").setTypes("mail").setQuery(QueryBuilders.matchQuery("attachments.content", "wrapping")).execute().actionGet();//search(new SearchRequest("imapriverdata").types("mail")).actionGet(); { "fields": [ "attachments.content" ], "query": { "match": { "attachments.content": "wrapping" } } }
It seems not be a river issue so maybe you want to ask your question here because iam not a attachment mapper expert: https://groups.google.com/forum/?fromgroups#!forum/elasticsearch and/or https://github.com/elasticsearch/elasticsearch-mapper-attachments/issues
I have tested with the mappings and documents from your test cases in ES-1.1.0 and plugin versions as mentioned above. I get the same results i.e. base64 returned.
Since mapper-attachments was updated in the beginning of this week (to 2.3.1) and there a bug is fixed with indexing .doc, I decided to test on ES-1.3.2 with latest versions of all plugins. Again I used all mappings and river-imap configuration as per your test cases. In this latest version, your test cases work. I have also tested with my own test cases and can confirm that they now also work fine.
I will migrate my environment to ES-1.3.2 and latest plugins.
Thank you for your efforts on getting this sorted out. I think the good part is that mappings is a bit more documented for attachments for your plugin. For me the main confusion was in the difference between the mapper-attachments documentation (which refers to "file" for an attachment) and river-imap documentation for attachment mappings (which refers to "content").
thanks too ...
Works for me too, using the config from your testcase. I must have had some error in my initial config. Thanks for your help!
Hi,
I'm having trouble accessing the Tika-extracted text content of attachments when downloaded with river-imap.
I have installed mapper-attachments and also the langdetect and icu_tokenizer plugins. I'm still on ElasticSearch 1.1.0 now and thus using river-imap 0.7b20 and mapper-attachments 2.0.0.
To test if the mapper-attachments plugin is working and I can extract text, I have created an index called test, created a mapping for a type called mapper. I have based this test on what is described here: https://gist.github.com/dadoonet/5310075
POST http://localhost:9200/test:
POST http://localhost:9200/test/mapper/_mapping:
I can index a document:
PUT http://localhost:9200/test/mapper/66d2006d-d269-5de7-5aa8-53f5b7b730e8:
The indexed document looks as such:
GET http://localhost:9200/test/mapper/66d2006d-d269-5de7-5aa8-53f5b7b730e8:
I can perform the following search: POST http://localhost:9200/test/mapper/_search with the following payload (searching for the term "ambiguous"):
This is the result which includes the file field and the extracted content:
To me this confirms the mapper-attachment plugin is working and I can extract my text.
So, my wish is to extract the text from attachments stored by river-imap as well. Therefore I made the following configuration of river-imap, containing type mappings and index settings:
As you can see I have configured content to be of type attachment, but with extended configuration for the mapper-attachments plugin.
I also made some changes to the mapping in order to do nested query searches on attachments, headers and to, which seem to be accepted.
When started, river-imap nicely syncs all messages from the IMAP server to the imapriverdata index. It creates the index nicely with the settings and mappings as specified in the above configuration.
When I perform a search I want to see the extracted text, just like when using the mapper-attachment in the test I described above. I perform this search:
POST http://localhost:9200/imapriverdata/mail/_search with the following payload:
this results in the following:
As you can see, attachments.content shows the base64 encoded data, as opposed to my test above with just the mapper plugin. I'd say the attachments.content field is configured exactly the same as in the test.
Any suggestion why this is happening?