microsoft / hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
https://aka.ms/hyperspace
Apache License 2.0
423 stars 115 forks source link

Fixing bug where large index files weren't being read fully #489

Closed alex-shchetkov closed 3 years ago

alex-shchetkov commented 3 years ago

What is the context for this pull request?

I ran into an issue where I was unable to use any of the created indexes, due to a Json Parser claiming it encountered invalid chars.

This was misleading, because the actual issue was that only a portion of the index file was being read.

What changes were proposed in this pull request?

Changing the FileSystem.read() to a FileSystem.readFully(). This is because using .read() does not always read in the full file.

This bug fix very likely fixes these: https://github.com/microsoft/hyperspace/discussions/431 https://github.com/microsoft/hyperspace/issues/373 https://github.com/microsoft/hyperspace/issues/297#issuecomment-747502799 (point #2)

Does this PR introduce any user-facing change?

No

How was this patch tested?

I compiled/packaged the code and ran it on an EMR (spark 3.1) cluster to generate a relatively large (8MB in my case) index file in an s3 location With this change I was able to use the index to run a query.