Closed hangy closed 2 weeks ago
LGTM. I agree with adding configurations in a future revision. For example, that will become needed if someone adds a Parquet table with multiple float[] arrays. Someone might do that if they use multiple vectorizers for the same corpus. For now, the automatic approach lowers the barrier to entry.
## 📝 Description
Import Parquet files by using
ReadAsTableAsync
.🔗 Related Issues
Fixes #44
💡 Additional Notes
This approach works with Neighborly's own export, as well as the Wikipedia file mentioned in https://github.com/nickna/Neighborly/issues/44#issuecomment-2161521910. In the future, it might be useful to have the field names configurable in some way, so that imports with other text names (but multiple string columns), and multiple float arrays could be imported. Right now, that's basically unsupported.
The main issue with the previous approach of using
if (data.Data is float[] d)
seems to be that it contains all values conflated to one array. The Parquet library uses some internal functions to read that inReadAsTableAsync
. An obvious downside of this could be the peak memory allocations.