plombard / SubversionRiver

A River for ElasticSearch to index subversion repositories.
18 stars 4 forks source link

Feature Request: Dealing with "HUGE" single SVN Revisions #3

Open ThomasMannIT opened 11 years ago

ThomasMannIT commented 11 years ago

Hello, your SVN River is working great.

While testing it i run into following situtation:

In an to be indexed SVN i have a SVN Revision Number in which a hero commited 6GB of data (not binary, but plaintext SQL dumps -.-) )

Reducing the bulk_size option down to 1 doesn't help, as it is still only one revision to be indexed. And 6GB data to be indexed leads to OutOfHeap Exceptions even on an 8GB machine with 7GB heap space.

At the moment i avoided the problem by letting the river index till x-1 revision and defining afterwards a start_revision with x+1.

But this workaround doesnt feel right.

Maybe some new "river options" could help:

Like:

a) max_bulk_size_in_mb b) File-Extension Filters c) Folder Filters d) Revision Filters

plombard commented 11 years ago

Hi, and thanks for the feedback. It should be fairly simple to implement some filters and max_size, I'll get to it as soon as I can. On the other hand, I'm not fond of filtering entire revisions or folders. They should be there, so the history of the repository can still be browsed. As the resulting index is far from sufficient to browse the repositories easily, it leaves a huge amount of functional implementations to the front-end (if you want something like ViewSVN). So I think having a trace of every revision/change is mandatory. The content, however, isn't. So we could replace the 6Gb text by just a warning message.

And you are absolutely right, Heap consumption is a concern, as I foolishly load the entire revision (content included) in memory. Maybe I'll try to index the file content separately from the metadatas, I don't know yet.