attribute with string type

jinserk commented 4 years ago

Hi @graykode!

One question to save dataset including any str type vector. I have a dataset having string type identifiers. These identifiers aren't used in the training or the prediction calculation but used in analyze of the prediction, especially for the outliers. Of course I can make any converting integer vector to map the integer id to the string identifier, but I found the DataAttribute supports string type, so tried to use it. Here the code I wrote is:

        data_config = nas.DataConfig(
            endpoint="127.0.0.1:9000",
            access_key="...",
            secret_key="...",
            dataset_name="...",
            additional={
                "dataset": "train",
                "len": len(d_list),
                "framework": "pytorch",
                "dttm": tz.localtime().strftime("%Y%m%d_%H%M%S%z")
            },
            compressor={
                "complevel" : 9,
                "complib" : "zlib",
            },
            attributes=[
                nas.DataAttribute('id', 'string', (1, ), 30),
                nas.DataAttribute('fp', 'bool', (IN_DIM, )),
                nas.DataAttribute('target', 'float32', (OUT_DIM, )),
            ]
        )

        data_saver = nas.DataSaver(config=data_config, refresh=True)

        for x in tqdm(d_list, dynamic_ncols=True, desc=d_name):
            key = x.get('id')
            feat = x.get('fp')
            target = x.get('target')
            data_saver({
                "id": key,
                "fp": feat,
                "target": target,
            })

but I got an error as:

Traceback (most recent call last):                                                                                                                                                                                                                                              
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main                                                                                                                                                                           
    return _run_code(code, main_globals, None,                                                                                                                                                                                                                                  
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code                                                                                                                                                                                      
    exec(code, run_globals)                                                                                                                                                                                                                                                     
  File "/mnt/ssd2/works/kyulux/kyumlm/tddft/ann/feature.py", line 204, in <module>                                                                                                                                                                                              
    store_to_matorage(ds)                                                                                                                                                                                                                                                       
  File "/mnt/ssd2/works/kyulux/kyumlm/tddft/ann/feature.py", line 197, in store_to_matorage                                                                                                                                                                                     
    save_data_list(train_list, "train")                                                                                                                                                                                                                                         
  File "/mnt/ssd2/works/kyulux/kyumlm/tddft/ann/feature.py", line 190, in save_data_list                                                                                                                                                                                        
    data_saver({                                                                                                                                                                                                                                                                
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/saver.py", line 317, in __call__                                                                                                                                                         
    self._check_data_numpytype()                                                                                                                                                                                                                                                
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/saver.py", line 257, in _check_data_numpytype                                                                                                                                            
    raise TypeError("I suspect you need to set the filetype.")                                                                                                                                                                                                                  
TypeError: I suspect you need to set the filetype.

I have no idea what the filetype means here, so I'd like to ask your help. Could you let me know how to use string type vector as a part of my dataset? Thank you in advance!

jinserk commented 4 years ago

It's embarrassing but I forgot to make all numpy arrays as batch-like (expand_dims). It worked if I changed:

            data_saver({
                "id": np.asarray([key]),
                "fp": np.expand_dims(feat, axis=0),
                "target": np.expand_dims(target, axis=0),
            })

But one thing I'd like to suggest is, it will be the best if we have any option to save all data element-wise: so if the attribute has the dims of (20, 20), then will be better to put not (B, 20, 20) but (20, 20) array with some additional option like elementwise=True in DataSaver.__call__() or separated function DataSaver.save_element().

graykode commented 4 years ago

@jinserk Thanks for the detailed bug reporting. The first issue with filetype is my mistake. The filetype is the content added to 0.2.0, and it is an option to save the corresponding file when you enter the file path here. This is a boolean option called filetype. Therefore, I set an error to occur when the filetype is False and the attribute type is string, and I will remove it.

fixed

remove raise error(TypeError: I suspect you need to set the filetype.)

graykode commented 4 years ago

It's embarrassing but I forgot to make all numpy arrays as batch-like (expand_dims). It worked if I changed:
            data_saver({
                "id": np.asarray([key]),
                "fp": np.expand_dims(feat, axis=0),
                "target": np.expand_dims(target, axis=0),
            })
But one thing I'd like to suggest is, it will be the best if we have any option to save all data element-wise: so if the attribute has the dims of (20, 20), then will be better to put not (B, 20, 20) but (20, 20) array with some additional option like elementwise=True in DataSaver.__call__() or separated function DataSaver.save_element().

The unittest related to this works without problems. Could you remove the raise TypeError("I suspect you need to set the filetype.") and try again?

graykode commented 4 years ago

@jinserk I think all bugs in this issue has been resolved, so I'll close it. If you need more, please re-open. Thanks for your reporting!

graykode / matorage

attribute with string type #17