Question about elasticsearch attachments and jdbc river

istvano commented 10 years ago

Hi Guys,

I am using elasticsearch to index my db table in mysql. I am using the river to fetch data from mysql and send it to ES.

The additional complexity is the following:

I have a filepath column in mysql. This is a reference to a binary file ( word document, pdf ) in the filesystem which is actually part of the data and I need to send that to ES as an attachement as well. As I need the content of the file indexed as well.

Could someone help me to figure out what are the right strategies to do this?

Should I implement a custom strategy? Should I move all my files to DB instead of filesystem ? is ES jdbc river able to send the file to ES. Would this approach slow down indexing considerably ?

I would appreciate any suggestions or ways I should go with.

Thanks in advance, Istvan

jprante commented 10 years ago

Have you tested if you can just select the paths from DB, build a JSON string array out of it, and use fs river? Do you have special requirements or do you just parse the files with Tika once?

istvano commented 10 years ago

Hi,

Thanks for getting back to me so fast !

I have a database table that looks like this ->

id submission_date release_date country_id classification category file_ref

I need all this info in ES. I will use ES as a full search engine searching for data even in my own metadata ( also filtering ) Once the attachment is in ES all i want to do is just to parse it with Tika, nothing more.

I am sorry for the ignorant question but I do not see how I can use FS river only. I need the db table data + the file in ES at the same time.

Is this possible ?

Do i need to write a new river combining FsRiver + JdbcRiver ? ( my original idea, sounds too complicated though )
Can I use a strategy at RiverSource or RiverFlow so that I can fetch the file. is this a bad idea ?
I would like to avoid storing binary in DB

Thanks for any pointers.

Istvan

jprante commented 10 years ago

You can use JDBC river for the DB table you described, then you can offload the file names from DB and run FS river over it. Then you can have linked results from the two rivers in ES, either by one index with two types, or two indices.

istvano commented 10 years ago

Hi,

That sounds good. How can I do it ? How can I make sure the FSRiver runs after the JDBCRIver. How can I associate the FSriver file with the documented indexed by Jdbc ?

Thanks.

istvano commented 10 years ago

is there an example somewhere on the wiki which shows how to offload the filenames to the FSRiver ?

istvano commented 10 years ago

I see what you mean, would not be an issue in this case that relevance would suffer ? Also in term of ordering how would I merge the two types together when searching ?

Thanks again for your time !

jprante commented 10 years ago

You can run both rivers concurrently.

I assume file_ref is the file name. If you index this, you can retrieve file information. Just design your index mapping accordingly.

An SQL example to get the filenames for FS river is very simple: select file_ref from table from your DB command line tool.

I do not understand why relevance would suffer, can you be more specific? What do you mean by ordering?

istvano commented 10 years ago

Hi,

First of all thanks for all the help ! I appreciate it especially it is Sunday! :)

I have some assumption I though I would validate.

1, Jdbc river will send the DB table to indexA containing the field of DB ES document id will be the DB row id

select * from table

2, FS river will index files from DB table

select file_ref from table.

Both fsriver and jdbc river has to update the same record identified by the same ID in ES. How will fsriver know which documents to update ?

Thanks

jprante commented 10 years ago

JDBC river is a spare time project.

There is no need to update the same record. You can use two mappings for the two rivers in one index.

Assume an index docs.

JDBC river can use the index docs and a type of jdbc. In that type mapping, a field file_ref is indexed containing the file name. Ensure in the mapping this field is not analyzed.
FS river can use the index docs and a type of fs. In that type mapping, a field file.filename is indexed containing the file name. Ensure in the mapping this field is not analyzed.
You can search in docs or docs/jdbc or docs/fs. If you want to follow a link from docs/jdbc, build a query {"query": { "file.filename" : {"term": "<value from file_ref>"} } } to search on docs/fs. Or if you want to follow a link from docs/fs to docs/jdbc, build a query {"query":{"file_ref":{"term":"<value from file.filename>"} } }

istvano commented 10 years ago

Hi,

this spare time project is great and I hope this article might be useful for others too.

That is clear! Thank you! I am pretty new to elasticsearch as you can tell.

if I went down this route would not I get 2 documents back with different relevance in the search result when searching?

In this case I would have to implement a logic in my API to make sure the duplicates are "merged" is there an easy way around that ?

Kind Regards

jprante commented 10 years ago

I do not know how your queries look like. Relevance is always related to query terms. Without more info I tend to say, do not worry.

If you search on docs, you may find two docs with same file name, the application logic can decide how to use the information of the two docs. This might not met your requirements.

If you search on docs/jdbc or docs/fs, there is at maximum one document per file name in a result set, and the other information can be appended by a second query on demand, by following the link given by the file name. This model follows the data model of your two sources and you can exactly see if the state of the indexed data is consistent with the state of the source.

If you want a single hit model, it is more complex to merge and to maintain updates, since you have two sources. In that case you would have to write a custom program to traverse the file names, read the file for Tika, use the file name as unique doc ID, select a single distinct row from DB with the file meta data, and build a JSON doc to index into ES.

Rivers do not work like this, they are not meant for collecting attributes for entities from various sources, instead they "stream" data into Elasticsearch "as is". Implementing entity construction in JDBC river would have to repeat much of FS river and the result would be quite complex in comparison to use JDBC and FS river as they are, side by side.

You also have to consider updates and the consistency of your two data sources. If your DB has file metadata but there is no file, you have to decide if this is a valid document for indexing or not. With two rivers, you do not have to make this decision. Only at the custom application query logic, you have to decide what to do if it is not possible to follow a link.

istvano commented 10 years ago

Hi,

Wow, thanks for the detailed information.

unfortunately my table contains data that needs to be analysed ( text based ) title and several other fields that will need to be used for filtering as well. Hence I think I need to keep the data in one index. As I have fields like document category, document type which is not available in the file at all.

is there no way to intercept the river inserting the records into ES ( I guess it uses the bulk API ) on the record bases and load the file from the filesystem and add it as an extra field ?

Thanks

jprante commented 10 years ago

As said, you can use two rivers for one index.

For processing file from filesystem, you must reimplement FS river in JDBC river.

ilkermutlu commented 9 years ago

@istvano , did you come up with a solution to this? I need to implement something very similar.

istvano commented 9 years ago

Hi, I ended up using this -> https://github.com/searchisko/elasticsearch-river-remote

I have built a tiny API which is called byt this remote river and handles everything. deletes, updates, incremental and full.

I think JDBC river is great but this is a special case which probably easier to deal with outside of it. I am not sure if this helps you but this is what I came up with.

Kind Regards,

jprante commented 9 years ago

@istvano do you use JDBC with https://github.com/searchisko/elasticsearch-river-remote and if so, how do you recognize deletes?

istvano commented 9 years ago

Hi, I am using the remote api at the moment as I can load the files from file system. Basically the solution is that in the mapping I can set up a field and a value. if that field contains the configured value for a given row the river treats it as a delete instead of update.

Support for data deletes

River supports correct update of search indices for two basic types of data deletes in remote system.

If deleted data simply disappear from the remote system API responses then they are deleted from search index at the end of next full update. It is not possible to catch this type of deletes during incremental updates.

If deleted data are marked by some flag only and correctly timestamped to be returned by your system in next incremental update request, then you can use remote_field_deleted and remote_field_deleted_value river config params to point river to this flag and delete data from search index even during incremental update. Configured delete flag is reflected during full update also. This feature is available from 1.6.2 version of the river.

Note: You have to correctly set analyzers for some fields in mapping to allow correct deletes from search index, see previous chapter!

jprante commented 9 years ago

@istvano yes, I know how RESTful APIs can recognize deletes because there are document identifiers (the URI), but I guess that the solution you found is not using JDBC result sets where there are no identifiers for documents?

I just ask because I would like to improve JDBC plugin by a document identifier mechanism. Maybe with the help of an SQL result set to URI mapper.

jprante / elasticsearch-jdbc

Question about elasticsearch attachments and jdbc river #286