Support the indexing of unstructured (extracted) content from files (in solr known content-types)

marc-portier commented 10 years ago

See http://wiki.apache.org/solr/ExtractingRequestHandler

/upload/extract can receive a multipart/formupload-ed file Solr will extract the contents and index it under the 'text' field to be indexed.

Some interesting side-notes:

Additional fields can be injected into the index by prefixing them with 'literal', providing a literal.id seems the minimal
fmap.content=... allows to specify another field in solr that will receive the extracted content

lbdremy commented 10 years ago

Hi @marc-portier,

I suppose that something you need now for your project, am I right?

lbdremy commented 10 years ago

I had a look to the code and I think we could make this more flexible by just returning a writable/readable stream like Client#createAddStream, this stream would be the object returns by request(options).

The signature of the method would be:

/**
 * Create a writable/readable `Stream` to extract stream's content into the Solr index
 * (argk... insert something better above)
 * @param {Object|String|Query} [query] -
 *
 * return {Stream}
 * @api public
 */

Client.prototype.createExtractStream = function(query){

};

The usage would be something like that:

var solr = require('solr-client');
var FormData = require('form-data');

var client = solr.createClient();
var form = new FormData();
form.append('my_file', fs.createReadStream('some/path/to/some/file'));

var query = 'literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true';
form
  .pipe(client.createExtractStream(query))
  .on('error', function(err){
   console.log(err);
  })
  .on('end', function(){
    console.log('extraction completed');
  });

From the doc http://wiki.apache.org/solr/ExtractingRequestHandler, they say they support raw POST too, so we could imagine users piping anything into Client#createExtractStream such as file stream, http stream, db stream etc ...

What do you think?

marc-portier commented 10 years ago

Makes sense. I think that your proposal wins on two points:

as you say: flexibility for end-users of the client lib in node context
clearly also: lesser surprise/confusion in the client-api, bigger resemblance with the createAddStream usage

However:

I'm not sure about the "query" argument though. The current split into "fields" and "options" makes sense to me in order to hide the required 'literal' prefix, but if you see another way to achieve the same, you'll easily convince me :)
I'ld also would prefer if the form-data dependency and complexity is solved inside the solr-node-client and not exposed to the users of the lib

So all in all some usage that would more go into something like

client.createExtractStream(fields, options).pipe(fs.createReadStream('/whatever/path/file.ext')

could maybe combine the best off both worlds? (This assuming we can register the error and end eventhandlers on the form-data instance inside the createExtractStream implementation? Thus leaving the client user with the highest flexibility and the least typing / boilerplate code to provide?)

Anyway, I should have a look at the createAddStream and how it works before I make more statements probably?

lbdremy commented 10 years ago

I'm not sure about the "query" argument though. The current split into "fields" and "options" makes sense to me in order to hide the required 'literal' prefix, but if you see another way to achieve the same, you'll easily convince me :)

Ok I understand so like that:

/**
 * Create a writable/readable `Stream` to extract stream's content into the Solr index
 * (argk... insert something better above)
 * @param {Array<String>} fields - 
 * @param {Object} [options] -
 *
 * return {Stream}
 * @api public
 */

Client.prototype.createExtractStream = function(fields, options){

};

I'ld also would prefer if the form-data dependency and complexity is solved inside the solr-node-client and not exposed to the users of the lib

For this one, the thing is:

Sending documents to Solr The ExtractingRequestHandler can process any document sent as a ContentStream ...

Raw POST Multi-part file upload (each file is processed as a distinct document) "stream.body", "stream.url" and "stream.file" request params.

At the moment the "stream.body", "stream.url" and "stream.file" request params option is available via Client#addRemoteResource. And I think we can cover the two other options with Client#createExtractStream(fields,options) by doing like I suggested (including form-data logic inside solr-client wouldn't allow that and limit the number of "parts" you can add in the multipart form to 1, making multipart form useless).

This make me think that we actually need to provide a way to set the HTTP headers, another parameter in the method signature of Client#createExtractStream. Three arguments I'm not quite fan... Maybe we add another class ExtractQuery which can be instantiated via Client#createExtractQuery().

So it would look like something like that:

var solr = require('solr-client');
var FormData = require('form-data');

var client = solr.createClient();
var form = new FormData();
form.append('my_file1', fs.createReadStream('some/path/to/some/file1'));
form.append('my_file2', fs.createReadStream('some/path/to/some/file2'));

var query = client.createExtractQuery().fields(['id','name','something_else']);
form
  .pipe(client.createExtractStream(query, form.getHeaders())
  .on('error', function(err){
   console.log(err);
  })
  .on('end', function(){
    console.log('extraction completed');
  });

Finally the signature would be:

/**
 * Create a writable/readable `Stream` to extract stream's content into the Solr index
 * (argk... insert something better above)
 * @param {Object|String|ExtractQuery} [query] -
* @param {Object} [headers] - a hash of all HTTP headers to set
 *
 * return {Stream}
 * @api public
 */

Client.prototype.createExtractStream = function(query){

};

Sorry for the long post, I was figuring out the problem while I was writing.

marc-portier commented 10 years ago

@lbdremy I'm not sure I see the case for extracting more then one file at a time, but I clearly might be missing something obvious.

As I see it:

per file to be uploaded I need a set of literal.* fields (at least an id) to go with that, so if you really want a signature that allows to upload multiple files, you will need a way to provide multiple sets of "fields" to accompany each file too.

And to be clear: the limitation as I see it is not the javascript signature, but the solr request: there is only one request that will carry the literal.* and they will be assigned to the (one) document that is created by the solr-extract-handler upon the content and metadata retrieved from the (one?) uploaded file. (IMHO solr at server side will thus use TIKA to extract content, links, metadata like title and author, content-type/mimetype from the uploaded file and create a local solr-document with those fields plus any literal fields that were added in the request. But again: I might be missing something.) -- So there is surely a difference with the addStream, which is really there to stream multiple documents to the solr-server, right?

Finally, sending only one file does not mean form-data usage becomes useless? In my understanding: even if there is only one part/file to upload, we're still using the multipart encoding mechanism from form-data lib for the required multipart-mime and base64 encoding?

Hence why I see the form-data thingy to be internal to the client-lib: just a mechanism to comply to the solr way of sending stuff to the extract-handler. The solr-client user just has one content-stream, and a set of fields, and maybe some additional options.

Apart from that: I'm not against bundling the options and fields into an "extractQuery"

And from a client-perspective I would expect to receive an extractStream into which to pipe() the contents of the content-stream (file I want to upload) to be extracted by solr, not the query and form-headers??

But also: I have to admit I'm not that fluent with nodejs-streams yet to propose changed code to express my thoughts more clearly. So bear with me.

marc-portier commented 10 years ago

Reading up on streams -- obvious first lesson is the inversion of pipe logic: duh, silly me!

readable.pipe(destination)

and not the other way around (as I was clearly thinking earlier)

So my hoped for single liner should at least look like this (reversed)

fs.createReadStream('/whatever/path/file.ext').pipe(client.createExtractStream(fields, options));

Still have to figure out how we can tunnel the data-events from the file-readstream into the extract-writestream... Hoping there is a ready util for that.

lbdremy commented 10 years ago

Finally, sending only one file does not mean form-data usage becomes useless? In my understanding: even if there is only one part/file to upload, we're still using the multipart encoding mechanism from form-data lib for the required multipart-mime and base64 encoding?

You're absolutely right, useless is really wrong/strong word to describe the limitation to one file, and like you said this limitation is actually enforced by Solr.

Also I made a mistake fields is actually not a Array of fields but an Object with each key representing a field and the value the value of the field, the name is kind of misleading.

So my hoped for single liner should at least look like this (reversed)
fs.createReadStream('/whatever/path/file.ext').pipe(client.createExtractStream(fields, options));

Yeah the problem with that is we need to pipe this stream into the form-data stream but the form-data stream is only readable not writable, so no luck for us. I had a look to the underlying stream used in form-data module it is a stream called combined-stream https://github.com/felixge/node-combined-stream/blob/master/lib/combined_stream.js, in the constructor it is clearly state that the stream is not writable.

It is kind of too bad that we can actually add stuff to this stream only via the append method and not the standard pipe method. I think something like that could be a work around it:

function CombinedStream() {
  var self = this;

  this.writable = true;
  this.readable = true;
  this.dataSize = 0;
  this.maxDataSize = 2 * 1024 * 1024;
  this.pauseStreams = true;

  this.on('pipe', function(source) {
    source.unpipe(self);
    self.append('unknown_field', source);
  }); 

  this._released = false;
  this._streams = [];
  this._currentStream = null;
}
util.inherits(CombinedStream, Stream);

Obviouly here we don't pass the field, we need to propose another way to give it.

marc-portier commented 10 years ago

Hm, that somewhat gave me the naive idea that I could try something like this: https://gist.github.com/marc-portier/92f595ce92aaefb5731b

(keeping the old signatures around for now in re-factoring mode and reusing the current tests I have)

But that doesn't work quite as expected, and mostly confirms my feeling I'm not up to speed with the complete Streams API for this kind of hacking.

Anyway, I'm not sure how to proceed now.

I can't see what I'm doing wrong, and start to feel the required stream hacking around the way form-data is working will keep on being an obvious dirty hack?
I'm a bit reluctant to dive into the depths of form-data and maybe fix things over there, or dissect the stuff we need from it and turn it into our plans... but I'm easily persuaded if anyone can shine some light on how things should develop
And I'm not sure if we can just take a step back and concede that form-data is useful as it is, and that there is enough differences between the addStream and the extractStream use cases to have different looking API signatures for them?

More thoughts welcome.

marc-portier commented 10 years ago

Good news: I've found my bug - have running tests again on top of the refactored solution! Will try to cleanup and commit to my branch this evening.

marc-portier commented 10 years ago

Done. Thx for the feedback and hint towards resolution.

marc-portier commented 10 years ago

Have been thinking some more on this. The current state is in fact working but the callback-handling seems somewhat awkward in a normal streaming context...

Currently the signature to createExtractStream expects a callback(err, obj) to be passed as last argument... leading to this kind of usage:

    var extractStream = client.createExtractStream(doc,options,function(err,data){
         sassert.ok(err,data);
         done();
     });
    fs.createReadStream(doc.path).pipe(extractStream);

However the streaming API suggests to communicate error, data and finalization through events:

    var extractStream = client.createExtractStream(doc,options);
    var err = null, data = null;
    extractStream
        .on('error', function(error){
            err = error;
        })
        .on('data' , function(obj){
            data = obj;
        })
        .on('end',   function(){
            sassert.ok(err,data);
            done();
        });
    fs.createReadStream(doc.path).pipe(extractStream);

@lbdremy wdyt? Looks like the more logical thing to do, no?

lbdremy commented 10 years ago

Yeah I think too.

I don't think the suggested approach is really the "stream way", because it kind of limit the usage of the readable part of the stream. (my assumption here is data in your example represents the whole Solr response deserialized i.e a JS object representing the whole JSON sent by Solr) I think you should instanciate a third stream which takes care of putting the incoming "data" together, and on "end" emits the whole response deserialized for to match your particular case. (which I agree is the most common use case)

 fs.createReadStream('some/path')
   .pipe(client.createExtractStream(doc, options))
   .pipe(someLibrary.reduceIntoObject())
   .on('end', function(data){
     console.log(data); // Solr response deserialized
   });

There are libraries like https://github.com/parshap/node-stream-reduce and https://github.com/dominictarr/event-stream to do this kind of job, or even https://github.com/dominictarr/JSONStream that could help you to achieve little bit different but still interesting if you care just about a few properties of the response.

marc-portier commented 10 years ago

Hm... I'm afraid you've lost me again on the 3rd stream and the usage of the 'some-library'. Anyway, I had cracked up some implementation this afternoon already (without the usage of library or 3rd stream) - I've checked that (ba65550) in just now, maybe you can have a look and comment where additional changes should go in?

In the mean time I'm reading up on some of the libraries you suggested.

marc-portier commented 10 years ago

After reading the stream-reduce suggestions you've send I have the feeling that the 3rd stream you suggest is in fact the one we create now inside the client (ie This ExtractHelperStream at https://github.com/lbdremy/solr-node-client/blob/ba655502df9ca8e5eba6471428ef48090267a13f/lib/solr.js#L203)

Counting them

the file-readstream
the http-connection to the
this extracthelper (actually in the form-data package there is another one)

So I'm getting confident that the current solution is pretty close. Actually the main thing I'm doubting is if we should be repeating 'end' events from the source to the target stream, or else document that people should listen to the correct stream?

dmgreene commented 9 years ago

Has there been any update on this? I'd love to use this feature in my code since I'm trying to index word documents in Solr.

marc-portier commented 9 years ago

I had a (for me) working solution in this branch: https://github.com/marc-portier/solr-node-client/tree/v0.3.x-patch90

you can test-drive that, and help discuss and/or fine-tune how this could be incorporated into the main branch.

Have to say my focus has shifted a bit since then - hope to be getting into it later again. Until then you have my blessing to tear apart my code :)

dmgreene commented 9 years ago

Thank you so much @marc-portier for your quick response. I hope someone does pick up your code and bring it back into the baseline. Personally, I'm still struggling with making a direct http call with update/extract. I can't seem to get solr to find the tika jars. :( But anyway, that's neither here nor there. Thank you for directing me to your code. Once I get around the tika jars issue, I'll check out your new feature.

lbdremy / solr-node-client

Support the indexing of unstructured (extracted) content from files (in solr known content-types) #90