paulcwarren / spring-content

Cloud-Native Storage and Enterprise Content Services (ECMS) for Spring
https://paulcwarren.github.io/spring-content/
Apache License 2.0
260 stars 65 forks source link

Searchable.search doesn't return keyword's position information #224

Open lmtoo opened 4 years ago

lmtoo commented 4 years ago

Searchable.search dosn't return keyword's position information, like pageNumber or text position

paulcwarren commented 4 years ago

Hmmm, yeah interesting issue. The search integration aims to return entities that have associated content that match the search terms.

That said, I can see how this would be useful. Perhaps the searchContent endpoint should return a resultset that links to the entity and also supplies additional information about the match including pageNumber, text position, relevancy and so on.

paulcwarren commented 3 years ago

@lmtoo you are using elasticsearch, correct?

lmtoo commented 3 years ago

hi, @paulcwarren I remove spring-content's elasticsearch module and implement the similar feature.

  1. use a TextExtractor to extract document's words

  2. index document's words by elasticsearch

  3. use spring-batch to do this job

TextExtractor like this :

`interface TextExtractor {

fun consumes(): String

fun extract(resource: Resource): List<String>

}`

extract method will return page's words , each element in this list as a page's words

each page's words map to a DocumentPage instance ,which have contentId 、 pageNumber and pageContent

paulcwarren commented 3 years ago

I see. So you have a custom solution for the page numbers part of it then. That makes sense because, to the best of my knowledge, neither elasticsearch or solr can provide page number information. The closest feature they offer is term vectors (for position) and highlighting for marked up abstracts. Even then I don't think solrj (the client API we use) supports term vectors. Plus I have little to no experience about how accurate the position information is that you get back from extracted text then applied to the original document content.

That said, I am definitely happy to extend spring content fulltext modules to support both term vectors and highlighting and then we can see if there is a customization for supporting page numbers but I can't think how to do that cleanly atm. Whilst there I will have a go at tackling your previous issue #223 too.

paulcwarren commented 3 years ago

So, here is where we are at with this one. Spring Content Solr, Elasticsearrch and REST all now support custom search types allowing you to define your own result type to be returned from a searchContent query. This support custom attributes and highlighting.

I would like to understand you solution more though to see how we progress from here.

If I understand your solution it sounds like you have one DocumentPage for each page of a document. The page's content is associated with that DocumentPage instance. Unclear to me if you still use searchContent to search that content, or not. Or if you do some other search against the word index directly?