Copy CSV and Storage TODOs for Phase 2

Functionality

[x] Partition large relationship files into smaller files to be able to load them one partition at a time. We can have a threshold for this > 5GB files are broken down into k many pieces. This requires an additional step in the loader, where we inspect the file size and decide on a safe partition size (e.g., something that will split it into roughly 1GB each if the edges were distributed uniformly). This could be achieved as follows:
- Suppose a relationship file, say knows.csv is 100GB and we know that its source node labels have 10K, 20K, and 15K many nodes. We pick the smallest one, e.g., 10K, and compute the "num-vertices-in-partition" that would lead to roughly 5GB partitions. In this case we would extrapolate that 500 vertices roughly correspond to 5GB. Then we scan through the file, and for each bucket of 500 src vertices, so vertices 0-499, 500-999, etc. we create a partition. That is every line of the knows.csv, where the source node's ID is between say 500-999, goes to the partition for 500-999. This is partitioning will help creating forward adjacency and property lists. We then repeat this process this time for destination vertices (this is for backward adjacency and edge property lists).
[x] Ability to ingest multi-partition files vPerson0.csv, vPerson1.csv, ... We need a feature where the pattern for the file can be specified as a regex (or we treat the filename as regex.
[x] Add header with version info into the metadata, also the logic to check storage version compatibility when we load the system up. (Phase 2)
[x] Allow the user not to provide the columns of the table in the csv file in the order that the DBMS expects. (Phase 2)
Usability
[ ] Warning Error Reporting Mechanism That Doesn't Fail: Add a thread-safe warning and error reporting logic to not fail during loading if a particular line in a node or edge file had a problem. Instead, we should accumulate a list of warnings and errors and log them both when they are found and also as a report. Also save this report somewhere.
- Example of a log error: "Skipping line 3 on eStudyAt.csv: "person,0,organisation,11344,2021", nodeID 11344 does not exist in the vOrganization.csv file.
[x] Consider if we treat relationship and node labels as case-insensitive? We should look at Neo4j to make this decision. => Let's be case insensitive (Phase 2)

kuzudb / kuzu

Copy CSV and Storage TODOs for Phase 2 #770

Functionality

Usability