Diet on metadata - Githubissues

eatpk commented 8 months ago

Currently, the metadata is being created like below: https://github.com/sangkeun00/analog/blob/12fc7a5e9aac2db6648d97f3e901b01024f81e93/analog/logging/mmap.py#L41-L50

List[Dict
  {
      "data_id": data_id,
      "size": bytes,
      "path": path,
      "offset": offset,
      "shape": arr.shape,
      "dtype": str(arr.dtype),
  }]

This is redundant especially when reloading the metadata to variable, we iterate through the whole metadata as a dictionary and reorganize it into a data_id_to_chunk may incur overhead of iterating through the whole data size by python for loop every time it is being loaded, after we do file IO of the metadata.json from here

Thus I propose, a new metadata.json schema like below:

Dict{
"data_id": List[str] data_id,
"path":List[Tuple(str,str)]
"offset": int32 // offset per data_id
"shape": List[Tuple(int,int)] // Not sure if the shape will be always (int,int), we can lax it to List[List[int]] in case we have to accomodate n-dim shaped model.
"dtype" str,
"dtype_byte": int, // dtype in bytes.
}

So example would be:

Dict{
"data_id": ["000","001","002",....], // doesn't have to be ordered just keeping track of the order.
"path":[("5","grad"),("3","grad"),("1","grad")]
"offset": 2140160 // 4 * (10*256 + 256 * 512 + 512 * 784)
"shape": [(10,256),(256,512),(512,784)]
"dtype" float32,
"dtype_byte": 4,
}

Assume from here index is index=17, we can formulate the nested_dict with

array = np.ndarray(offset/dtype_byte , dtype, buffer=mmap, offset=index * offset, order="C") // offset = 17 * 2140160
for i,tup in enumerate(path):
   e1,e2 = tup
   path_offset = shape[i][0] * shape[i][1] // Let me know if shape can be more than two dimensional.
   nested_dict[e1][e2]= np.ndarray(shape[i], dtype, buffer=mmap, offset=index * offset, order="C")

In this way, we don't have to reiterate once more after the file load(json load). Also, this makes the https://github.com/sangkeun00/analog/issues/54 more intuitive and easier.

Let me know what you think @sangkeun00, if you approve, I will take care.

sangkeun00 commented 8 months ago

This proposal makes sense, and you can implement this whenever you want! Though, I'd like to know your insights on two things, for the matter of the priority management.

Do you think that this suggestion would lead to the (meaningful) performance improvement?
Does it improve the code maintenance in the future? (Or would it make implementing new features easier?)

Let me know what you think!

eatpk commented 8 months ago

Yes.
a. If we were to do a sequential read of data_ids, I presume iterating over List[str] will be faster than List[Dict] to do a block read and this List[Dict] is longer than the former List[str] by 3X in the case of mnist(since each data_id has "1","3","5"), and even more when we have deeper models. b.Also, we may be able to save more memory, as for current simple MNIST run, we have around 4MB+ meatadata size, if the data gets 1000X, this can go over 4GB easily. So we may want to diet this. c. we don't have to do this for loop..?

Unrelated FYI: Another low hanging part we can improve the performance is... I suggest to use np.array(size pre defined) over python List, and using list.append() since it involves copying the array every time when the list lenght gets increased than the alloted memory, unless there is a reason to use List over array. <- this can be major refactoring of code that can boost our performance as our data size gets larger.

I came up with this idea as I was trying to implement flattened logging, I think this would be easier in terms of maintenance, at least in my opinion. Currently, we have a lot of codes that have to iterate through hashmap structure in order to find the configs ,which necessitates a lot of for loops in the code. We can reduce this by decreasing the use of dictionary structure and make it into an array or list.

sangkeun00 commented 8 months ago

Great! Thanks for the quick response. If you can add a performance comparison in your PR, that would be perfect!

eatpk commented 7 months ago

I ran experiment and it seems like the decrease of the metadata size was from 4MB to 415 KB for MNIST data that had 6000 data points. It can reduce much further in case of the model size increase and data points size increase.(But this can be minimal in compared to the memory used to hold the parameters from the models :p)

logix-project / logix

Diet on metadata #77