samvera / valkyrie

A Data Mapper library to enable multiple backends for storage of files and metadata in Samvera
Other
34 stars 23 forks source link

Allow string properties longer than 1000 characters to be indexed as text in all the usual ways, not just *_tsim. #962

Closed dchandekstark closed 2 months ago

dchandekstark commented 2 months ago

See https://github.com/samvera/valkyrie/issues/961.

I did not add a test, since there was not an existing one, and the setup seemed potentially complicated. Assist welcome!

dchandekstark commented 2 months ago

Heh, I see there is effectively a test, just not where I was looking. It uses a value of 100,000 chars ... so I wonder if that's hitting a different limit?

dchandekstark commented 2 months ago

Yes, the test in question was putting a string of 100,000 a's in the title, and presumably that caused the tokenizer to error. I have replace that with a file of generated English text.

tpendragon commented 2 months ago

Yes, the test in question was putting a string of 100,000 a's in the title, and presumably that caused the tokenizer to error. I have replace that with a file of generated English text.

So the test fails with all the "a"s? I think we added this because of a bunch of H-OCR we were indexing. If the field has to be english text, that might break my indexing.

dchandekstark commented 2 months ago

So the test fails with all the "a"s? I think we added this because of a bunch of H-OCR we were indexing. If the field has to be english text, that might break my indexing.

I can't speak to that concern, but the status quo is breaking my indexing ... :( My first thought was to make the cutoff configurable. I suppose that would best preserve the current behavior of the library, while letting me increase the limit on my end. If you agree, I'm happy to revise the PR accordingly.

tpendragon commented 2 months ago

Hmm, yeah, we gotta get this fixed for you. Maybe I'll run my test suite with this and see if it works.