FileSystem Read Func. Enhancement

AhmedYasen commented 2 years ago

update FileSystem struct > read method to use an LRU for file handles to reduce the headache of opening the same file for every request

AhmedYasen commented 2 years ago

I was thinking that the file variable is the whole file handle but ok. For the for loop which collects (downloads) the file blocks, I think if I use lru, it will always have the last 10 blocks' because the loop doesn't end until it collect all blocks found in the metadata, thus the lru is useless.

muhamadazmy commented 2 years ago

No, as i explained. the File return from the meta is just (informaiton) about the file including the block hashes. The cache on the other hand is an abstract way to open the 'block' either from the disk, or by downloading it from the server.

The loop only go over all needed block to read a certain chunk of the data. If you read the code correctly the loop first skips all blocks that is not need (seek to the correct offset) then read a certain length then if the required chunk of data is read the loop breaks.

Which means if you are accessing a big file in the middle it will NOT download all blocks, but only the blocks that are needed to read this part of the file.

I don't get why u say the LRU is useless. The point of the LRU is that you will keep the files that has been already OPEN (I am talking about std::fs::File not the types::File) from the cache in memory because opening this file is costly then next time you want to access the same file it's already open and in memory, you can optimize for seek as well. I hope you know that when you open a file in rust and then this file is out of scope the file will get closed. By keeping the file in cache (along with the seek) the file will stay open until the cache is full then is dropped.

The loop then will keep opening and pushing (needed) chunks in lru. if the loop need to access the same chunk again (in a subsequent call to read) the file is already there.

Think about it like this

blocks
|---------------|-----------------|----------------|
read (1) 
^-----^
read (2)
        ^-------------^

A first read call wants to read from (0 to 100) so it reads part as per the graph above. then it will return. On this call the file is open from the cache (might also get downloaded if needed) A 2nd call will ask (100, 300) so it will try to open the first block again but that is already in LRU and also seeked to the proper position so you can directly read the data (to end of the block). Because the read spans to the next block, this block will be opened again (because it's not in LRU yet) and pushed to LRU, and u read what u need from it. Other subsequent read calls can happen

Once the LRU is getting full the older un-used chunks will start dropping out and automatically closed.

AhmedYasen commented 2 years ago

Yes, This is what I understand. But what I mean by 'useless' is; suppose that the lru has size of 10, the loop will push every new block to the lru and kickout the oldest one, so at the end of the loop the lru will have the last 10 pushed blocks (and from the last requested file)

as you said before that the file blocks' reading process is sequentially, so I think it will be beneficial if we store the last un-completed (and seeked) block (fd).

as an example

blocks |-------0--------|--------1---------|---------2-------|---------3-------|--------4--------| read (1) ^-----^ (store fd of block-0) read (2) ^-------------------------------------------------------^ (store fd of block-3)

muhamadazmy commented 2 years ago

Yes, you are right that we 'optimally' need to store only the last block. But LRU is more generic solution where u can configure how many files u want to keep. The lru size can then be set to 2 or even 1. But keeping 10 files won't be an issue because sometimes read is not sequential (random access is also a thing) imagine your are playing a video file. You can be reading sequentially (playing the video) and you can also rewind to watch the previous minute (then you jump back in time)

Also, you didn't take into account that the LRU is shared between all open files on the file system. The read can be called in parallel to read multiple files (different inodes) at a different offsets. So you can't keep track of the last block of all open files, right? Hence an LRU is still a better solution,

AhmedYasen commented 2 years ago

Well, This is very clear now.

For the point keeping track of the last block, it will be the last block of the requested blocks and this is possible because the whole loop runs in a single thread on the requested blocks.

so, I think it may be like this,

for block in blocks {
   fd := if file in lru {
       use file from lru
   }  else {
         cache.get()
   }

       if file is not at correct seek{
         seek
       }
   loop {
       read = fd.read()
       ...
       if read == 0 {
             store fd in the lru (because it is not completed yet
       }
   }
}

muhamadazmy commented 2 years ago

Yes, this is a good idea, to only keep last open file before u return in lru

threefoldtech / rfs

FileSystem Read Func. Enhancement #3