ExtractingRequestHandler support

GoogleCodeExporter commented 8 years ago

Support ExtractingRequestHandler aka tika aka SolrCELL

http://wiki.apache.org/solr/ExtractingRequestHandler

Original issue reported on code.google.com by mauricio...@gmail.com on 5 Oct 2009 at 8:39

GoogleCodeExporter commented 8 years ago

http://groups.google.com/group/solrnet/msg/e05ac9b473e0de2d

On indexing multiple files per request: http://www.mail-archive.com/solr-
user@lucene.apache.org/msg33954.html

Original comment by mauricio...@gmail.com on 26 Mar 2010 at 1:58

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Added basic support for the Solr ExtractingRequestHandler extension. Since I 
have no experience with .NET the code will need some extra work from someone 
else.

Usage:

Stream f = File.OpenRead("c:\\example.pdf");
solr.AddFile(f, new Dictionary<string, string> {{ "literal.id", "id1234" }});
f.Close();
solr.Commit();

Original comment by mrandres...@gmail.com on 29 Jun 2010 at 3:45

Attachments:

ExtractingRequestHandler.diff

GoogleCodeExporter commented 8 years ago

Thanks! I'll review it when I get some time.

Original comment by mauricio...@gmail.com on 29 Jun 2010 at 4:05

GoogleCodeExporter commented 8 years ago

Ok, I reviewed the patch, it's a good start, but here are the issues I found:

 * No tests
 * Only works with FileStreams (should work with any Stream)
 * Uses a buffer the size of the file - big files would use lots of memory
 * Depends on Windows association of file extension to find out content-type: I'm not sure how reliable this is. For example, does a bare-bones Windows installation know about application/pdf? Is setting the correct content-type required? Getting the content-type of a generic Stream could be difficult.
 * Some code duplication between Post() and PostBinary() - some refactor needed there.

I applied the patch in a new branch: 
http://github.com/mausch/SolrNet/tree/ExtractingRequestHandler

Original comment by mauricio...@gmail.com on 3 Jul 2010 at 6:38

GoogleCodeExporter commented 8 years ago

Original comment by mauricio...@gmail.com on 3 Jul 2010 at 6:38

Changed state: Started

GoogleCodeExporter commented 8 years ago

It seems that Solr does *not* support multiple files in a single request: 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg33997.html

Original comment by mauricio...@gmail.com on 11 Dec 2010 at 2:44

GoogleCodeExporter commented 8 years ago

I'm currently implementing the ExtractingRequestHandler in my SolrNet fork I 
hope to sort out the issues raised by mauricio about the earlier patch. If 
anyone have an idea on how they would like it to work please let me know. I 
will try to get some unit testing done but I'm not used to writing tests so may 
need some help.

Original comment by nazmul...@gmail.com on 8 Feb 2011 at 11:14

GoogleCodeExporter commented 8 years ago

Feel free to post any questions in the google group.

Original comment by mauricio...@gmail.com on 9 Feb 2011 at 1:11

GoogleCodeExporter commented 8 years ago

So Whats the status of the ExtractingRequestHandler in SolrNet?

Original comment by jeroen.g...@gmail.com on 18 Feb 2011 at 9:23

GoogleCodeExporter commented 8 years ago

Status update: 
http://groups.google.com/group/solrnet/browse_thread/thread/8babf22c83e59aa1

Original comment by mauricio...@gmail.com on 18 Feb 2011 at 6:37

GoogleCodeExporter commented 8 years ago

Merged with master in 80beaac9cf608ed37b67741c1be2deffcfea9551

Added an integration test in 9c7523dc9a767694d2d3b181c9a85e67807cc9ad , it 
could use some more integration tests.

Original comment by mauricio...@gmail.com on 23 Feb 2011 at 5:53

GoogleCodeExporter commented 8 years ago

Is the ID really required in ExtractParameters? The ID value could also be 
provided through fmap.

Original comment by mauricio...@gmail.com on 30 Apr 2011 at 7:57

GoogleCodeExporter commented 8 years ago

Answering to #13 : 
http://wiki.apache.org/solr/ExtractingRequestHandler#Getting_Started_with_the_So
lr_Example says : "the literal.id=doc1 param provides the necessary unique id 
for the document being indexed"

Original comment by mauricio...@gmail.com on 8 Dec 2011 at 1:33

GoogleCodeExporter commented 8 years ago

The handler requires an ID field, which is a good idea to have anyway, but it 
is hard-coded to lowercase "id" in the ExtractCommand.  I have a a fix for this 
in my fork.  Haven't done much testing around this yet though...

Original comment by gmpig...@gmail.com on 4 Apr 2012 at 10:15

GoogleCodeExporter commented 8 years ago

Moved to https://github.com/mausch/SolrNet/issues/87

Original comment by mauricio...@gmail.com on 8 Sep 2013 at 4:22

srikanthv123 / solrnet

ExtractingRequestHandler support #79