Closed dchandekstark closed 2 months ago
Heh, I see there is effectively a test, just not where I was looking. It uses a value of 100,000 chars ... so I wonder if that's hitting a different limit?
Yes, the test in question was putting a string of 100,000 a's in the title, and presumably that caused the tokenizer to error. I have replace that with a file of generated English text.
Yes, the test in question was putting a string of 100,000 a's in the title, and presumably that caused the tokenizer to error. I have replace that with a file of generated English text.
So the test fails with all the "a"s? I think we added this because of a bunch of H-OCR we were indexing. If the field has to be english text, that might break my indexing.
So the test fails with all the "a"s? I think we added this because of a bunch of H-OCR we were indexing. If the field has to be english text, that might break my indexing.
I can't speak to that concern, but the status quo is breaking my indexing ... :( My first thought was to make the cutoff configurable. I suppose that would best preserve the current behavior of the library, while letting me increase the limit on my end. If you agree, I'm happy to revise the PR accordingly.
Hmm, yeah, we gotta get this fixed for you. Maybe I'll run my test suite with this and see if it works.
See https://github.com/samvera/valkyrie/issues/961.
I did not add a test, since there was not an existing one, and the setup seemed potentially complicated. Assist welcome!