Closed eatpk closed 7 months ago
This proposal makes sense, and you can implement this whenever you want! Though, I'd like to know your insights on two things, for the matter of the priority management.
Let me know what you think!
data_id
s, I presume iterating over List[str]
will be faster than List[Dict]
to do a block read and this List[Dict]
is longer than the former List[str]
by 3X in the case of mnist(since each data_id has "1","3","5"), and even more when we have deeper models.
b.Also, we may be able to save more memory, as for current simple MNIST run, we have around 4MB+ meatadata size, if the data gets 1000X, this can go over 4GB easily. So we may want to diet this.
c. we don't have to do this for loop..?Unrelated FYI: Another low hanging part we can improve the performance is... I suggest to use
np.array
(size pre defined) over python List, and usinglist.append()
since it involves copying the array every time when the list lenght gets increased than the alloted memory, unless there is a reason to use List over array. <- this can be major refactoring of code that can boost our performance as our data size gets larger.
flatten
ed logging, I think this would be easier in terms of maintenance, at least in my opinion. Currently, we have a lot of codes that have to iterate through hashmap structure in order to find the configs ,which necessitates a lot of for loops in the code. We can reduce this by decreasing the use of dictionary structure and make it into an array or list.Great! Thanks for the quick response. If you can add a performance comparison in your PR, that would be perfect!
I ran experiment and it seems like the decrease of the metadata size was from 4MB to 415 KB for MNIST data that had 6000 data points. It can reduce much further in case of the model size increase and data points size increase.(But this can be minimal in compared to the memory used to hold the parameters from the models :p)
Currently, the metadata is being created like below: https://github.com/sangkeun00/analog/blob/12fc7a5e9aac2db6648d97f3e901b01024f81e93/analog/logging/mmap.py#L41-L50
This is redundant especially when reloading the metadata to variable, we iterate through the whole metadata as a dictionary and reorganize it into a
data_id_to_chunk
may incur overhead of iterating through the whole data size by python for loop every time it is being loaded, after we do file IO of the metadata.json from hereThus I propose, a new metadata.json schema like below:
So example would be:
Assume from here index is
index=17
, we can formulate thenested_dict
withIn this way, we don't have to reiterate once more after the file load(json load). Also, this makes the https://github.com/sangkeun00/analog/issues/54 more intuitive and easier.
Let me know what you think @sangkeun00, if you approve, I will take care.