In current version, HDFSInputFormat reads the first directory(path) only. For example, if the path is /data, it will list the directory of /data and read the items(must be file) like /data/a and /data/b.
In order to be more flexible, it could support reading an organized path recursively(all files are in the last directories). For example, if the data is stored as a time-based path like /data/year/month/dates/FILES, it prefers scanning all items in path '/data' rather than giving a concrete path '/data/year/month/dates`. Of course, we need to set the maximum recursive layers to avoid the tremendous reading.
In current version, HDFSInputFormat reads the first directory(path) only. For example, if the path is
/data
, it will list the directory of/data
and read the items(must be file) like/data/a
and/data/b
.In order to be more flexible, it could support reading an organized path recursively(all files are in the last directories). For example, if the data is stored as a time-based path like
/data/year/month/dates/FILES
, it prefers scanning all items in path '/data' rather than giving a concrete path '/data/year/month/dates`. Of course, we need to set the maximum recursive layers to avoid the tremendous reading.