husky-team / husky

A more expressive and most importantly, more efficient system for distributed data analytics.
http://www.husky-project.com/
Other
98 stars 55 forks source link

Recursive/Incremental file listing in HDFSInputFormat #302

Open kygx-legend opened 6 years ago

kygx-legend commented 6 years ago

In current version, HDFSInputFormat reads the first directory(path) only. For example, if the path is /data, it will list the directory of /data and read the items(must be file) like /data/a and /data/b.

In order to be more flexible, it could support reading an organized path recursively(all files are in the last directories). For example, if the data is stored as a time-based path like /data/year/month/dates/FILES, it prefers scanning all items in path '/data' rather than giving a concrete path '/data/year/month/dates`. Of course, we need to set the maximum recursive layers to avoid the tremendous reading.

ddmbr commented 6 years ago

And it would be better if we can avoid listing all the files at once, as there could be too many files. We could list the files batch by batch.