metadata file change from FileStorage to Database

graykode commented 4 years ago

Now metadata attribute is managed as a JSON file. However, as a long-term plan, we will modify it to work concurrency with the database.

e.q : data

{
    "additional": {
        "framework": "pytorch",
        "mode": "test"
    },
    "attributes": [
        {
            "itemsize": 0,
            "name": "image",
            "shape": [
                1,
                28,
                28
            ],
            "type": "float32"
        },
        {
            "itemsize": 0,
            "name": "target",
            "shape": [
                1
            ],
            "type": "int64"
        }
    ],
    "compressor": {
        "complevel": 0,
        "complib": "zlib"
    },
    "dataset_name": "mnist",
    "endpoint": "/Users/graykode/shared",
    "filetype": [],
    "indexer": {
        "3335": {
            "length": 3335,
            "name": "tmpuoetuutie1ec9bdf4cb142e8.h5"
        },
        "6670": {
            "length": 3335,
            "name": "tmpzzv9w4r94aac98a99ee74d52.h5"
        },
        "10000": {
            "length": 3330,
            "name": "tmp3qvp1bbtbf74db88d9a0499c.h5"
        }
    }
}

seongpyoHong commented 3 years ago

How about this ER Diagram?

Bold type : Primary Key
Italic type : Foreign key

graykode commented 3 years ago

@seongpyoHong

There are opinions about several DB types.

The bucket id is hashed here and used as a string type. Therefore, it should be expressed as a string rather than an integer.
The filetype is a list, and even if it is converted to a string, it exceeds at least 255 characters. Therefore, another type of alternative is needed.
additional is a dict type, but like filetype, it exceeds at least 255 characters even when converted to string. Therefore, a type such as Text seems more appropriate.

Additionally, for the attribute name, It's better to use pothole notation(lowercase letters and underbar).

seongpyoHong commented 3 years ago

@graykode

Are there hashed IDs in other tables?
I'll change the data type fromvarchar to text and naming convention to pothole notation.

graykode commented 3 years ago

@seongpyoHong I've fixed ER Diagram like below:

CREATE TABLE bucket (
  id varchar(255) primary key not null,
  additional text not null,
  dataset_name varchar(255) not null,
  endpoint varchar(255) not null,
  compressor varchar(255) not null,
  sagemaker boolean not null default false
);

CREATE TABLE files(
  id serial primary key not null,
  name varchar(255) not null,
  bucket_id varchar(255),
  constraint bucket_id foreign key (bucket_id) references bucket(id)
);

CREATE TABLE attributes (
  id serial primary key not null,
  name varchar(255) not null,
  type varchar(255) not null,
  shape varchar(255) not null,
  itemsize integer not null,
  bucket_id varchar(255),
  constraint bucket_id foreign key (bucket_id) references bucket(id)
);

CREATE TABLE indexer (
  id serial primary key not null,
  indexer_end bigint not null,
  length integer not null,
  name varchar(255) not null,
  bucket_id varchar(255),
  constraint bucket_id foreign key (bucket_id) references bucket(id)
);

Since filetype is a list type that is frequently modified, so I decided that it would be better to make this attribute into one table.

As one minor addition, I know that variable-length strings (Text) can slow down the DB. So, how about setting an additional attribute that only uses Text type to very large n, varchar(n)?

graykode commented 3 years ago

Versions prior to 0.4.0 manage metadata in json format, so using only s3 could maintain the shape. However, since metadata is managed by the RDBMS, it is necessary to write RDS code for AWS RDMS.

graykode / matorage

metadata file change from FileStorage to Database #26