nickna / Neighborly

An open-source vector database
MIT License
9 stars 2 forks source link

feat: Add support for reading Parquet files in ETL #48

Closed hangy closed 2 weeks ago

hangy commented 2 weeks ago

## 📝 Description

Import Parquet files by using ReadAsTableAsync.

🔗 Related Issues

Fixes #44

💡 Additional Notes

This approach works with Neighborly's own export, as well as the Wikipedia file mentioned in https://github.com/nickna/Neighborly/issues/44#issuecomment-2161521910. In the future, it might be useful to have the field names configurable in some way, so that imports with other text names (but multiple string columns), and multiple float arrays could be imported. Right now, that's basically unsupported.

The main issue with the previous approach of using if (data.Data is float[] d) seems to be that it contains all values conflated to one array. The Parquet library uses some internal functions to read that in ReadAsTableAsync. An obvious downside of this could be the peak memory allocations.

nickna commented 2 weeks ago

LGTM. I agree with adding configurations in a future revision. For example, that will become needed if someone adds a Parquet table with multiple float[] arrays. Someone might do that if they use multiple vectorizers for the same corpus. For now, the automatic approach lowers the barrier to entry.