tenzir / public-roadmap

The public roadmap of Tenzir
https://docs.tenzir.com/roadmap
4 stars 0 forks source link

HDFS Connector #83

Closed dominiklohmann closed 5 months ago

dominiklohmann commented 9 months ago

Similar to our S3 and GCS Connectors, Apache Arrow comes with an HDFS Filesystem abstraction. We can utilize this to implement an hdfs connector.

### Definition of Done
- [ ] Agree upon the desired arguments for the loader
- [ ] Agree upon the desired arguments for the saver
- [ ] Implement the `hdfs` loader
- [ ] Implement the `hdfs` saver
mavam commented 9 months ago

It's important to note that HDFS is also a way to access Azure Data Lake Storage (ADLS). This issue is deeply linked to https://github.com/tenzir/public-roadmap/issues/82.

Hadoop has a dedicated module that exposes ADLS via a URL of the form adl://<Account Name>.azuredatalakestore.net/. During the design of the hdfs loader and connector, we should think about whether we want to provide an adl shim that sets things up for a seamless ADLS experience through HDFS.

dominiklohmann commented 9 months ago

Note that support for Azure in Arrow's filesystem abstraction is currently being worked on. Not sure when it'll be there, but that may soon be an option as well.

dominiklohmann commented 5 months ago

We do not see a need for this currently.