plombard / SubversionRiver

A River for ElasticSearch to index subversion repositories.
18 stars 4 forks source link

SVNException when branch has gaps in revision history #5

Open d42ohpaz opened 11 years ago

d42ohpaz commented 11 years ago

Not sure why our repository is set up the way that it is (it predates my employment), but when trying to index the branch, it fails due to an unhandled SVNException. Would it be possible to catch and log the exception, but allow the process to continue so it can find valid revisions to index? As it stands currently, because of this, it cannot index anything from the particular branch.

My particular topology is to index trunk separate from each individual branch so that my users can cherry pick which part of the repository as a whole they wish to search.

The debug messages generated by the river are below. I did substitute the variable information due to company policies, but I do not believe that it detracts from the report. Please advise if I can provide other information to aid in this issue.

[2013-11-21 09:47:49,259][INFO ][river.subversion         ] [subversion] [svn][{river-name}] Indexing subversion repository : {repos}/{path}
[2013-11-21 09:47:49,265][DEBUG][river.subversion         ] [subversion] [svn][{river-name}] Get Indexed Revision Index [{river-name}] Type [indexed_revision] Id [_indexed_revision_36f8485ca6ba16de05638be719a62b0f] Fields [{}]
[2013-11-21 09:47:49,265][INFO ][river.subversion         ] [subversion] [svn][{river-name}] Indexed Revision Value [0]
[2013-11-21 09:47:49,308][DEBUG][river.subversion         ] Repository Root: {svn-url}{svn-path}
[2013-11-21 09:47:49,308][DEBUG][river.subversion         ] Repository UUID: 939eac4c-494f-49cc-954a-2f9dce6e9c68
[2013-11-21 09:47:49,322][DEBUG][river.subversion         ] Repository HEAD Revision: 1450
[2013-11-21 09:47:49,420][DEBUG][river.subversion         ] [subversion] [svn][{river-name}] Checking last revision of repository : {repos}/{path} --> [1450]
[2013-11-21 09:47:49,420][DEBUG][river.subversion         ] [subversion] [svn][{river-name}] Indexing repository {repos}/{path} from revision [1] to [401] incremental [false]
[2013-11-21 09:47:49,421][INFO ][river.subversion         ] Retrieving revisions of {repos}{path} from [1] to [401]
[2013-11-21 09:47:49,475][WARN ][river.subversion         ] [subversion] [svn][{river-name}] Subversion river exception
org.tmatesoft.svn.core.SVNException: svn: E160013: '{svn-path}/!svn/bc/401{path}' path not found: 404 Not Found ({svn-url})
    at org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:64)
    at org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:51)
    at org.tmatesoft.svn.core.internal.io.dav.DAVRepository.logImpl(DAVRepository.java:993)
    at org.tmatesoft.svn.core.io.SVNRepository.log(SVNRepository.java:1035)
    at org.tmatesoft.svn.core.io.SVNRepository.log(SVNRepository.java:940)
    at org.tmatesoft.svn.core.io.SVNRepository.log(SVNRepository.java:864)
    at org.tmatesoft.svn.core.io.SVNRepository.log(SVNRepository.java:1412)
    at org.elasticsearch.river.subversion.SubversionCrawler.getRevisions(SubversionCrawler.java:139)
    at org.elasticsearch.river.subversion.SubversionRiver$Indexer.run(SubversionRiver.java:252)
    at java.lang.Thread.run(Thread.java:744)
[2013-11-21 09:47:49,476][DEBUG][river.subversion         ] [subversion] [svn][{river-name}] Subversion river is going to sleep for 900000 ms

The river was created using the following cURL call:

curl -XPUT 'localhost:9200/_river/{river-name}/_meta' -d '{"type": "svn", "svn": {"repos": "{repos}", "path":  "{path}", "login": "{login}", "password": "{password}", "start_revision": "1", "update_rate": "900000", "bulk_size": "400"}}';
plombard commented 11 years ago

It's indeed a bug. But rather than catching the exception and allowing it to continue, I'd rather test the path validity prior to run the svn log operation. And then run the operation on a revision where the path exists. I'll see what I can come up with. In the meantime, a potential workaround for you could be to increase the bulk_size parameter to a much bigger size, possibly the entire range of revisions in your repository, to ensure that the river will try to retrieve revisions from a range where the branch is garanteed to exist at one point. The log operation will only pick the revisions where paths of the branch were changed, so with a bit of luck, only a few, compared to the entire repository. Of course, it will only work if each branch that you want to index is not big enough to exhaust the heap of the elasticsearch jvm.

d42ohpaz commented 11 years ago

Thank you for the suggestion. That did the trick for indexing my branch. I definitely agree that your suggestion sounds like a better idea than mine and I look forward to when you have the opportunity to implement it.