halleck1 / bzreader

Automatically exported from code.google.com/p/bzreader
0 stars 0 forks source link

Can't index danish wiki. #19

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Installed BZreader
2. Tried to index wikiquoteEN (works fine)
3. Troed to index
http://download.wikimedia.org/dawiki/20100107/dawiki-20100107-pages-articles.xml
.bz2
- Fails during indexing. Tried with older danish dumps also, but they
doesnt work either.

What is the expected output? What do you see instead?
It starts the index, but windows 7 then says the program stopped working,
and it shutdown the program.
No error message.

What version of the product are you using? On what operating system?
The newest. 1.0.12

Please provide any additional information below.
Im on a windows 7 ultimate. English version.
I have tried letting the program running as a administrator. No change.
Works fine with english version of wikipedia.
Could it be because of danish letters? (Ææ, Øø and Åå)

Original issue reported on code.google.com by SBenjam...@gmail.com on 11 Jan 2010 at 4:55

GoogleCodeExporter commented 9 years ago
Hello,

I suspect this is due to the Danish characters. Unfortunately, I don't really 
have
free time at the moment, I hope someone else could fix this one.

Regards,
Vlad

Original comment by halle...@gmail.com on 12 Jan 2010 at 8:58

GoogleCodeExporter commented 9 years ago
SBenjaminP, is this still an issue?  If it is, I'll try to index the Danish 
Wikipedia 
myself and see what happens.

Original comment by asaf.bartov on 7 Apr 2010 at 6:45

GoogleCodeExporter commented 9 years ago
Okay, the issue reproduces on Windows XP too.

It's a problem in Snowball.NET, the stemmer used by Lucene.NET.

Here's the exception, for the record:

System.SystemException was unhandled
  Message="System.Reflection.TargetInvocationException: Exception has been thrown by 
the target of an invocation. ---> System.ArgumentOutOfRangeException: Index and 
length must refer to a location within the string.\r\nParameter name: 
length\r\n   at 
System.String.InternalSubStringWithChecks(Int32 startIndex, Int32 length, 
Boolean 
fAlwaysCopy)\r\n   at System.Text.StringBuilder.ToString(Int32 startIndex, 
Int32 
length)\r\n   at SF.Snowball.SnowballProgram.slice_to(StringBuilder s) in 
C:\\Asaf\\wikimedia\\bzreader\\Snowball.NET\\SF\\Snowball\\SnowballProgram.cs:li
ne 
466\r\n   at SF.Snowball.Ext.DanishStemmer.r_undouble() in 
C:\\Asaf\\wikimedia\\bzreader\\Snowball.NET\\SF\\Snowball\\Ext\\DanishStemmer.cs
:line 
353\r\n   at SF.Snowball.Ext.DanishStemmer.Stem() in 
C:\\Asaf\\wikimedia\\bzreader\\Snowball.NET\\SF\\Snowball\\Ext\\DanishStemmer.cs
:line 
441\r\n   --- End of inner exception stack trace ---\r\n   at 
System.RuntimeMethodHandle._InvokeMethodFast(Object target, Object[] arguments, 
SignatureStruct& sig, MethodAttributes methodAttributes, RuntimeTypeHandle 
typeOwner)\r\n   at System.RuntimeMethodHandle.InvokeMethodFast(Object target, 
Object[] arguments, Signature sig, MethodAttributes methodAttributes, 
RuntimeTypeHandle typeOwner)\r\n   at 
System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, 
Binder binder, Object[] parameters, CultureInfo culture, Boolean 
skipVisibilityChecks)\r\n   at 
System.Reflection.RuntimeMethodInfo.Invoke(Object obj, 
BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo 
culture)\r\n   
at System.Reflection.MethodBase.Invoke(Object obj, Object[] parameters)\r\n   
at 
Lucene.Net.Analysis.Snowball.SnowballFilter.Next() in 
C:\\Asaf\\wikimedia\\bzreader\\Snowball.NET\\Lucene.Net\\Analysis\\Snowball\\Sno
wball
Filter.cs:line 72"
  Source="Snowball.Net"
  StackTrace:
       at Lucene.Net.Analysis.Snowball.SnowballFilter.Next() in 
C:\Asaf\wikimedia\bzreader\Snowball.NET\Lucene.Net\Analysis\Snowball\SnowballFil
ter.c
s:line 76
       at Lucene.Net.Index.DocumentWriter.InvertDocument(Document doc) in 
C:\Asaf\wikimedia\bzreader\Lucene.Net\Index\DocumentWriter.cs:line 181
       at Lucene.Net.Index.DocumentWriter.AddDocument(String segment, Document doc) 
in C:\Asaf\wikimedia\bzreader\Lucene.Net\Index\DocumentWriter.cs:line 106
       at Lucene.Net.Index.IndexWriter.AddDocument(Document doc, Analyzer analyzer) 
in C:\Asaf\wikimedia\bzreader\Lucene.Net\Index\IndexWriter.cs:line 616
       at Lucene.Net.Index.IndexWriter.AddDocument(Document doc) in 
C:\Asaf\wikimedia\bzreader\Lucene.Net\Index\IndexWriter.cs:line 603
       at BzReader.Indexer.TokenizeAndAdd(Object state) in 
C:\Asaf\wikimedia\bzreader\BzReader\Indexer.cs:line 584
       at System.Threading._ThreadPoolWaitCallback.WaitCallback_Context(Object state)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, 
ContextCallback callback, Object state)
       at 
System.Threading._ThreadPoolWaitCallback.PerformWaitCallbackInternal(_ThreadPool
WaitC
allback tpWaitCallBack)
       at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback(Object state)
  InnerException: 

Some quick thoughts:
1. We should upgrade the bundled Lucene.NET and Snowball.NET.  There's a ticket 
already open for this, assigned to me.  I'll try to find time to make progress 
with 
this.
2. We should be tolerant of any kind of exception during stemming and indexing, 
so 
that BzReader itself doesn't crash, even when indexing failed completely.

I'll be looking into it later this week.

Original comment by asaf.bartov on 7 Apr 2010 at 8:51

GoogleCodeExporter commented 9 years ago
looks forward to hear any news. :-)

Original comment by SBenjam...@gmail.com on 13 Apr 2010 at 5:04

GoogleCodeExporter commented 9 years ago
See ticket #10 for upgrading Lucene.Net. Snowball.Net shipping with BzReader is 
already the latest one available for .Net.

Original comment by itamar.s...@gmail.com on 18 Jul 2010 at 10:16