marklogic-community / Corona

Community REST API for MarkLogic
Other
37 stars 9 forks source link

RFE: Support binary documents #26

Closed hunterhacker closed 12 years ago

hunterhacker commented 12 years ago

It would be good to support storing binary documents.

It would also be good to optionally extract metadata and text content from them and make it available for query.

ryangrimm commented 12 years ago

Have support for storing and retrieving binary documents. Can also retrieve binary documents that are managed outside of Corona. Still need to support: searching binary documents and automatic extraction of content and metadata from binaries.

ryangrimm commented 12 years ago

Binary documents are now searchable as well.

ryangrimm commented 12 years ago

When extracting out the content of a binary file, what should be done when said content is extremely large? For example, say someone inserts a PDF file that's 10,000 pages of pure text. It's likely that the server won't be configured with enough resources to handle that large of an XML document. Should we truncate the file?

hunterhacker commented 12 years ago

What docs the standard CPF pipeline do? My guess is it assumes it can handle anything, and lets the underlying memory error propagate up if it's wrong.

Which resources are you concerned about exactly, Ryan? If memory, truncation will be hard because it needs to be in memory first before truncation.

ryangrimm commented 12 years ago

Good point. Specifically I was thinking about the in-memory stand size. But if we'd like to just let things fly and allow resource limits to crop up, I'm okay with that. If it proves to be too annoying we can come up with options to help alleviate the problem.

ryangrimm commented 12 years ago

Our binary support now includes inserting, fetching, updating binary documents. Users can also include their own document (XML or JSON) representation of the binary document. If available, metadata and content will be extracted from the binary. This extraction can be turned off. Users can search the extracted content via the wordInBinary structured query constraint.