rajveerb / ml-pipeline-benchmark

1 stars 0 forks source link

Symlink files cached over remote filesystem #4

Closed rajveerb closed 9 months ago

rajveerb commented 10 months ago

Given a file in a remote filesystem, check if its content are cached after accessing it once.

Needs to be checked in context of C4130 node in cloudlab using a long term dataset.

rajveerb commented 10 months ago

The symlink files get cached in memory which leads to inaccurate E2E VTune profiling because the entire dataset is symlink for synthetic dataset in the paper.

If the goal is to only profile preprocessing then the symlink option is great because I/O related CPU time will not be accounted in profiling for fetching from storage into main memory.

Used vmtouch to check if a file is cached in memory.