mbraceproject / MBrace.Core

MBrace Core Libraries & Runtime Foundations
http://mbrace.io/
Apache License 2.0
211 stars 46 forks source link

Intelligent CloudFlow partitioned indexing #152

Open isaacabraham opened 8 years ago

isaacabraham commented 8 years ago

One of the things that e.g. Hive allows you to do is define indexes on flat files based on their physical structure e.g. imagine a folder structure like: -

{country}/{city}/{companyName}.txt

In Hive you can provide hints on above so it can intelligently search only files that match e.g. UK/London rather than having to scan through all files. Is this something that is (a) needed in MBrace, and (b) achievable?

dsyme commented 8 years ago

I believe this use of folders as an implicit partitioning/indexing structure is one of the main reasons that Hadoop/Hive/HDFS have been successful - and hence Spark too. The ease with which people can organize masses of data using mostly normal Unix file system commands and then have it partitioned implicitly is very impressive.

I'd love to see these ideas brought into MBrace more completely. I believe one piece of the puzzle is to have a "mstore.exe" or "mb.exe" command-line utility tool that can be used in the obvious ways:

mstore cp *.txt /data/foo/*.txt
mstore rm /data/foo/**/logs/*.log
mstore mv ......
mstore ls ....
mstore mkdir ....

just like HDFS and a bit like azure.exe but working with any MBrace system. The configuration/get-cluster mechanism to bind to Azure or Thespian or AWS would have to be implicit user env variables etc.