hydromatic / morel

Standard ML interpreter, with relational extensions, implemented in Java
Apache License 2.0
291 stars 15 forks source link

File reader #209

Closed julianhyde closed 6 months ago

julianhyde commented 6 months ago

This facility adds a type-safe system to browse directories, sub-directories, and read files as lists of records.

Suppose I am in a directory that has a sub-directory data, which has a sub-directory scott, which has files bonus.csv, dept.csv, emp.csv.gz, salgrade.csv:

$ ls -lR data
data:
total 4
drwxrwxr-x 2 jhyde jhyde 4096 Dec  9 13:04 scott

data/scott:
total 20
-rw-rw-r-- 1 jhyde jhyde  50 Dec  9 13:02 bonus.csv
-rw-rw-r-- 1 jhyde jhyde 130 Dec  9 13:00 dept.csv
-rw-rw-r-- 1 jhyde jhyde 420 Dec  9 13:00 emp.csv.gz
-rw-rw-r-- 1 jhyde jhyde 127 Dec  9 13:03 salgrade.csv

I can access these from Morel using the file object. For example, here is the contents of the file data/scott/dept.csv:

./morel
$ file.data.scott.dept;
val it =
  [{deptno=10,dname="ACCOUNTING",loc="NEW YORK"},
   {deptno=20,dname="RESEARCH",loc="DALLAS"},
   {deptno=30,dname="SALES",loc="CHICAGO"},
   {deptno=40,dname="OPERATIONS",loc="BOSTON"}]
  : {deptno:int, dname:string, loc:string} list

Each file is a list of records (obtained by parsing the CSV format); each directory is a record, and its fields are its constituent files and sub-directories. Here is the directory data/scott:

$ file.data.scott;
val it =
  {bonus=<relation>,dept=<relation>,emp=<relation>,salgrade=<relation>}
  : {bonus:{comm:real, ename:string, job:string, sal:real} list,
    dept:{deptno:int, dname:string, loc:string} list,
    emp:{comm:real, deptno:int, empno:int, ename:string, hiredate:string,
      job:string, mgr:int, sal:real} list,
    salgrade:{grade:int, hisal:real, losal:real} list}

In addition, a directory has special fields .., ~, and /, which take you to the parent directory, user's home directory, and root directory. For example, file.data.`..`.data.scott is equivalent to file.data.scott.

The file value is the starting point for all navigation. It represents the current working directory.

Since Morel is a strongly-typed system, there is a problem that is most noticeable when browsing a large file system: we have traverse every directory, and parse every file, in order to report the type of the file value. We solve this by introducing a new type, called partial records. They work as follows.

When you first ask for the type of file, it reports a partial record:

$ file;
val it = {...}: {...}

Fields of a partial record are progressively discovered, on demand. When you have browsed into the data sub-directory, it has learned of a new field:

$ file.data;
val it = {...}: {...}
$ file;
val it =
  {data={...}, ...}
  : {data: {...}, ...}

When we have asked for the type of dept, we know yet more about file and file.data:

$ file.data.scott.dept;
val it =
  [{deptno=10,dname="ACCOUNTING",loc="NEW YORK"},
   {deptno=20,dname="RESEARCH",loc="DALLAS"},
   {deptno=30,dname="SALES",loc="CHICAGO"},
   {deptno=40,dname="OPERATIONS",loc="BOSTON"}]
  : {deptno:int, dname:string, loc:string} list
$ file;
val it =
  {data={scott={dept=<relation>, ...}, ...}
  : {data: {scott: {dept: {deptno:int, dname:string, loc:string} list, ...}, ...}

The knowledge of a type increases over time, as fields are discovered, but never decreases. The type system never forgets a field it has seen once.

julianhyde commented 6 months ago

I posted a demo: https://www.youtube.com/watch?v=uybUjCYsBKI