phymbert / spark-search

Spark Search - high performance advanced search features based on Apache Lucene
Apache License 2.0
23 stars 2 forks source link

IndexNotFoundException #192

Open StackTraceYo opened 2 years ago

StackTraceYo commented 2 years ago

hello - im attempting to use this library v0.2 in a yarn, with my driver running on the cluster

I am encountering the following exception -

Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in MMapDirectory@/local/hadoop/disksdl/yarn/nodemanager/usercache/spotci/appcache/application_1617967855014_1171701/container_e136_1617967855014_1171701_02_000001/tmp/spark-search/application_1617967855014_1171701-sparksearch-rdd0-index-3 lockFactory=org.apache.lucene.store.NoLockFactory@4a1941a4: files: [] at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:715) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:84) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:64)

Im wondering if there was any info on where to start looking for why it would be empty?

thanks

StackTraceYo commented 2 years ago

@phymbert any input would be appreciated.

I also sometimes see a similar error but there are other files in the index

org.apache.lucene.index.IndexNotFoundException: no segments* file found in MMapDirectory@/local/hadoop/disksdl/yarn/nodemanager/usercache/spotci/appcache/application_1617967855014_1203855/container_e136_1617967855014_1203855_02_000001/tmp/spark-search/application_1617967855014_1203855-sparksearch-rdd0-index-0 lockFactory=org.apache.lucene.store.NoLockFactory@2a910ebf: files: [_0.fdt, _0_Lucene84_0.tip]

wondering if it has to do with how im loading queries, i have queries saved as parquet, I am loading them using standard spark.read.parquet so its a dataset, then calling

searchRDD.searchJoinQuery[SearchQuery]( queries.rdd, queryBuilder = queryStringBuilder(_.q), topKByPartition, minScore )

SearchQuery is just a case class wrapper around a string

if I create a dataset directly using spark by passing a sequence like spark.createDataset(Seq(..)) I dont see this problem