Architecture of 3store - Githubissues

nick-ak96 / 3storage

Here we implement and discuss 3storage, Erlang storage of triples.

1 stars 0 forks source link

Architecture of 3store #2

Open iztok-savnik opened 5 years ago

iztok-savnik commented 5 years ago

Here we discuss the architecture of 3store... if you have a suggestion for a different name let me know...

iztok-savnik commented 5 years ago

Here are the initial ideas of the triple storage manager.

File of pages

Pages store a predefined number of triples.
All pages must have the same length.
n-th page is accessed by reading|writting from|to the position n*page-size in db file.
read command reads a sequence of pages in the internal buffer.
write command writes a sequence of pages to the disk.

ISAM tree

Key-based access to triples.
Based on a page-based file of triples.

Server

Server can handle multiple requests from multiple Erlang processes.
A queue of requests stores all requests.
- A single request is performed at a given time.
- After it is finished the next request is handled.

iztok-savnik commented 5 years ago

There are many details that will have to be carefully designed.

Since we will deal with big data we need cursor based access. Experiences with mnesia show you can not use get-all-answers type of access methods.
We have to support server-like behavior. Multiple processes can access multiple triple tables. (as you see, the terminology has to be defined first to be able to see exactly what we are talking about).
At some point, sessions with 3store will have to be defined to support cursors.
At this point, I would forget about page caches. This will come at the "distributed epsilon" level.
Since a lot of data should flow through the storage manager, copying in memory and long lists of function parameters should be avoided.
Triple-pattern access methods can be implemented directly on the 3store in the first phase. In the second phases "distributed epsilon" layer is added.

iztok-savnik commented 5 years ago

I refined the text on architecture.

Page-level file server

The page-level does not know anything about the contents of pages.
All pages must have the same length.
Page server accepts requests from many processes.
A page identifier points to the beginning of the given page, i.e., it is an offset of a page in page file.
n-th page is accessed by reading|writting from|to the position n*page-size in db file.
read command reads a sequence of pages in the internal buffer.
write command writes a sequence of pages to the disk.
To be able to read very large quantities of data cursors have to be used.
- page level has to work in "transactions" that are opened and closed explicitely
- operation read (and write) must work with "chunks"
- one chunk or the complete result (if size is less than one chunk) is read

ISAM index

This level works with the whole pages
I think that ISAM index can be implemented as Erlang module
Implements simple ISAM index on top of file of pages
Index pages are composed of an array of key*value pairs.
- keys are composed of instances of K\subseteq{S,P,O}
- keys have to define the ordering also in the case that they are composed.
- we can have sparse indexes, i.e., values parts of key*value in the last tree level point to data pages.

TP access method

Triple-pattarn access method
This module is based on a file of pages (for data) and ISAM index
I suggest this is one Erlang process for one opened TP access method
There is no need for a queue of requests since
- A single request is performed at a given time from a single process

iztok-savnik commented 5 years ago

Some more comments on the proposed architecture.

Page-level file server (PL-FS) is a process that accepts request from the multiple processes (TP-AM processes.
TP-AM is a processes opened for a given triple-pattern.
I am not sure about the data transfer from PL-FS to a particular TP-AM. It would be better to direct the resuts to the process that has opened a given AP-AM. One intermediate transfer is omitted.
ISAM index can be defined as a module that is used by TP-AM to provide index-based scan for a given triple-pattern.
From the higher levels, i.e., query processor, we have two numbers: query identifier and session identifier. Session ID can be used to direct user to the same "cache" (this can also be epsilon table) that he used in the session for the previous queries. This is called "affinity scheduling".
Query identifier can be used for the "transactions" in the Page-level file server.

nick-ak96 commented 5 years ago

@iztok-savnik I suggest that along with the issues we will start a wiki in this repo (as I see it is disabled currently), where we will put the final results of our discussion here in issues. That will of course, later on, serve as a documentation for the software. Also, it will help us to not get lost in long discussions and to make a cleaner implementation.

Currently, I see you defined 3 levels for the storage manager:

page level
indexes
triple-pattern access methods

The last one I think is already a higher level. We must take it into account in order to design lower levels, but first I think we should focus more on designing the lower level architecture.

Page level

The page-level does not know anything about the contents of pages.

What happens during a record insertion? Should we fetch an appropriate page for insertion from index, modify it in memory and then just blindly override? I probably do not understand completely what you meant here.

You mentioned read/write operations with chunks. I think It can be implemented as an intermediate module between page level and access level, that will also have caching (or we can add caching later). In literature, it is called a buffer manager.

ISAM index

I agree with your choice of index structure at this point. But, do you think this index structure will be sufficient for 3store in general? I mean it is fine if the data is more or less static, but if it becomes dynamic the B+-tree will be a better option. We should probably think of supporting both of them, but that should be in some future phase.

I suggest that we start with some definitions, lists of processes and methods that page-level should support.

I will start here. But I think we should put them into wiki, and discuss here, if needed.

Processes

Management of requests. The requests come from the buffer manager. Requests are accepted and put into request queue. A pool of workers process requests in parallel from the request queue.
Write request. The request is a transaction.
Read request. If any write requests are executing, the request is paused.

Methods

Read pages
Write pages

iztok-savnik commented 5 years ago

I suggest that along with the issues we will start a wiki in this repo (as I see it is disabled currently), where we will put the final results of our discussion here in issues. That will of course, later on, serve as a documentation for the software. Also, it will help us to not get lost in long discussions and to make a cleaner implementation.

Totaly agree.

Currently, I see you defined 3 levels for the storage manager: page level indexes triple-pattern access methods

Yes.

The last one I think is already a higher level. We must take it into account in order to design lower levels, but first I think we should focus more on designing the lower level architecture.

Totally agree.

What happens during a record insertion? Should we fetch an appropriate page for insertion from index, modify it in memory and then just blindly override? I probably do not understand completely what you meant here.

Right, TP-AM would really be triple-level storage manager. Then insertion operation would be a part of triple-level storage manager (TSM). Therefore, I suggest that the 3rd level is TSM.

There are details on how the messages will flow. I will present the triple-pattern query node from the side of the query evaluation sub-system in more detail.

Current interface of the tp_query_node.erl includes the implementation of a query node: it accesses the database (it should be general and accept various interfaces to storage managers or dbmss) and returns the result to the parent query node via streams. Documentation is https://big3store.github.io/big3store/ (tp_query_node).

You mentioned read/write operations with chunks. I think It can be implemented as an intermediate module between page level and access level, that will also have caching (or we can add caching later). In literature, it is called a buffer manager.

At the level of query evaluation (implemented with query nodes (processes) linked in a tree) we already have "chunks" that stand for the max number of pages (this is again the concept from the query evaluation in particular the unit of network transfer in our query evaluation system) transfered for one tp_query_node request to dbms before giving the chance to their requests. Something like this... I will go in some more details here later.

Obviously, the concepts defined on the level of query evaluation will have to be well synchronised with the storage manager that we design.

Caching can be implemented without much effort in the page-level storage manager. ??

I agree with your choice of index structure at this point. But, do you think this index structure will be sufficient for 3store in general? I mean it is fine if the data is more or less static, but if it becomes dynamic the B+-tree will be a better option. We should probably think of supporting both of them, but that should be in some future phase.

ISAM implementation can be seen as a part of B+-tree implementation. Therefore, initially we have only bulk-loading and efficient retrieval, but later we can add the balance and insertions, if needed.

One of the ideas of the architecture is to implement initially as simple as possible storage manager, but we should see the complete architecture together with the links to tp_query_node and b3s implementation of streams.

iztok-savnik commented 5 years ago

I tried to enable Wiki but I found the following " Upgrade to GitHub Pro or make this repository public to enable Wikis.". I do not know much about github ... should we make it public?

nick-ak96 commented 5 years ago

Well, if there are no problems regarding license agreements, I think we can make it public. Do we need to check this with Kiyoshi?

iztok-savnik commented 5 years ago

As you have proposed we can now design an Erlang process for Page-level storage manager... Can you specify the basic functionality?

nick-ak96 commented 5 years ago

As I see it, the general picture of page level should look somehow like that.

Pages are organized into page files. Since pages are of fixed size, page files will also hold a fixed (limited) amount of pages. That will allow us to search for pages in page files using the offset (page ID). The server should be able to handle multiple page files which will allow us to distribute data into several disks. All page files can be maintained by a pool of workers that is managed by a master worker. Each worker manages a portion of page files. A worker accepts requests to read from and write to pages. Master worker will handle the requests between buffer level and workers. Master worker also will manage transactions. In this way we will be able to do faster transactions.

If the above text seems reasonable, then we can discuss the following questions:

Since page level does not know about the content of the pages, does it mean that at this level we read and write the whole page, i.e. cannot edit a single record on a page? ISAM tree is basically a collection of pages, so I think we need the ability to write records into pages as well.
How do we manage page files? I assume that we want to handle multiple page files at a time for distributive purposes. I would then create some header structure for such file, that describes the content: empty pages, offset from other files.
Since page size is fixed, do we also fix the size of records or are they variable length?

iztok-savnik commented 5 years ago

I had a bit different ideas about the SM.

Page-level works with one data file! If multiple TP-AMs need access to a particular file then they should work with one (the same) page-level process.

I see that you still have a "sequential view": page-level storage manager is a module that can be spawned with given parameters by some other module (there are no dispatchers on the page-level).

Pages are organized into page files. Since pages are of fixed size, page files will also hold a fixed (limited) amount of pages. That will allow us to search for pages in page files using the offset (page ID). The server should be able to handle multiple page files which will allow us to distribute data into several disks.

One process is created for one data file (composed of pages). I will check also this once again...

All page files can be maintained by a pool of workers that is managed by a master worker. Each worker manages a portion of page files. A worker accepts requests to read from and write to pages. Master worker will handle the requests between buffer level and workers. Master worker also will manage transactions. In this way, we will be able to do faster transactions.

My initial ideas were that one page-level process handles one file. We should also have stored the mapping from filenames to existing processes. Do we need a master worker?

Since page level does not know about the content of the pages, does it mean that at this level we read and write the whole page, i.e. cannot edit a single record on a page? ISAM tree is basically a collection of pages, so I think we need the ability to write records into pages as well.

On the ISAM level (below is page-level file) we have pages including pairs <key,pointer>. We need to be able to handle the pairs <key,pointer>.

How do we manage page files? I assume that we want to handle multiple page files at a time for distributive purposes. I would then create some header structure for such file, that describes the content: empty pages, offset from other files.

This is a good point. We would need to handle multiple disks. Let's re-think again the page-level process design from this point of view. I think the existing design should work... header may be a good idea.

Since page size is fixed, do we also fix the size of records or are they variable length?

Data records are handled at TP-access.method level. ISAM records (pairs) should probably have fixed length for a given key (currently all components of triples are integer-encoded, we may re-trink this).

I will do one more integration of the ideas from my side and try to produce a more stable text about page-level and others by the evening.

I will also upload the papers on storage managers ...

nick-ak96 commented 5 years ago

One process is created for one data file

All idea about workers came to me because I was thinking that one process in Erlang is a system process. So, my intention was that since creation and handling of processes in the system has some penalties, one page file per process would not be enough.

I now realized that Erlang process is a process inside Erlang VM and they are much lighter and cheaper than system processes.

So, I agree, one process per page file should be OK.

We should also have stored the mapping from filenames to existing processes. Do we need a master worker?

We could just register processes with the same names as files. No need to store this data. We do need to keep track of all the created processes. That is one of the reasons why I thought about the master worker. Also, how would you handle transactions? If you need to write data to several page files, both processes that handle these files must be aware that they must hold locks at the same time. I might be wrong here, I will have a look at transactions again.

iztok-savnik commented 5 years ago

I now realized that Erlang process is a process inside Erlang VM and they are much lighter and cheaper than system processes.

Yes, they write somewhere that they can spawn more than 1M processes in a second. Also, communication is very fast...

We could just register processes with the same names as files. No need to store this data. We do need to keep track of all the created processes. That is one of the reasons why I thought about the master worker.

This is a good solution. You can query if the process exists and then access through a name.

I was about to suggest to use ETS tables. One possible problem that bothered me is the critical section created with the table. However, careful access rules could provide safe access.

Similarly can be done with the processes. You ask if there exists a process A. It does not exists so you create it. But somebody else has just created A, so you should get an error trying to create A. After A is created in whatever way I would again obtain a pid for a static process name.

Also, how would you handle transactions? If you need to write data to several page files, both processes that handle these files must be aware that they must hold locks at the same time. I might be wrong here, I will have a look at transactions again.

I would skip transactions for the Master's thesis. To my opinion, this could be very complex. Still, it makes sense to think about the transactions during the design so that they can be added more easily later.

One more thing that is bothering me is the additional data transfer between file-level AM and index-level AM. Probably it would be better to design the first version without considering much the cost of data transfer (from one process to another). Latter we can try to optimize this.

iztok-savnik commented 5 years ago

Here is the new sketch of the Page file server and ISAM index

I scaled down some initial ideas.
Page file server will most probably not be used by itself but only together with some index.
- Still, I would first test it sepately.
Maybe we can move this discussion into a separate issue

Page file server

Page file is a sequence of pages
- Page is a basic unit of processing
- All pages have the same length
- A page identifier points to the beginning of the given page (offset of a page in page file)
- The page FS does not know anything about the contents of pages
Architecture of Page file server
- Page server is an Erlang process
- Page file server accepts requests from many (TP-AM) processes
- Requests are placed in a queue and served one by one
- Buffer manager is used to serve as a cache
  - Page are read into buffer pool
  - Some page replacement strategy shoud be used
Page server interface
- N-th page is accessed by reading|writting from|to the position n*page-size in db file
- read command reads a sequence of pages in the internal buffer
- write command writes a sequence of pages to the disk.

ISAM index

ISAM index is a key/value store
- Keys/Values are stored in pages
  - Keys are composed of instances of K\subseteq{S,P,O}
  - Values can be either complete triples or references to triples
- Index pages are storing keys/pointers to subtrees
- Data pages store entered keys/values
Architecture of ISAM index
- ISAM index is an Erlang module and process
- ISAM index is stored in a file of pages
  - ISAM index module includes Page FS module
- At this level it would make sense to implement cursors (I suggest we discuss this)
- We can have dense/sparse indexes... this should be illuminated
ISAM index interface
- Open access for a given key with the comparison operation equal|eq-more-then|eq-less-then
- read results of the operation (possible with a cursor operations)
- Close access to ISAM index
- Batch creation of ISAM index from complete data