openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
863 stars 210 forks source link

Could somebody show me how to use Bulk Loader to import a CSV into Virtuoso as RDF #1303

Closed candlecao closed 3 months ago

candlecao commented 3 months ago

I asked similar question to Chat GPT 4. It gave me guide as:

1. Prepare the CSV file in a path that can be accessed by Virtuoso

For example, the snippets are (1 header and 3 rows of entries):

Chant_ID,incipit,genre,src_link
562633,Ecce nunc palam loqueris et,http://www.wikidata.org/entity/Q582093,https://cantusdatabase.org/source/123756
671551,Dominus tamquam ovis ad victimam,http://www.wikidata.org/entity/Q582093,https://cantusdatabase.org/source/669163
562160,Seniores populi consilium fecerunt ut,http://www.wikidata.org/entity/Q604748,https://cantusdatabase.org/source/123756
...

--this file is put in a folder called "my_virdb" (where the Virtuoso server process was started; some files like virtuoso.ini, virtuoso.db are put there)

2. Prepare a text file with suffix .ld

ld_dir ('.', 'sampleData.csv', 'http://example.com/mygraph/UseBulkLoaderForCSV');
DB.DBA.TTLP_MT (file_to_string_output ('./mappings.ttl'), '', 'http://example.com/mygraph/UseBulkLoaderForCSV', 2);
rdf_loader_run();
checkpoint;

--Given that the CSV file has the name "sampleData.csv" and put in the "my_virdb" folder and (I supposed) '.' means the sampleData.csv is put in the same directory with sampleData.csv.
--This associates the mapping from CSV to URI (please see as below)

3. Define mappings.ttl that appeared in the .ld file as above

The content in the file, e.g.:

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix chant: <http://example.com/ontology/chant/> .
@prefix data: <http://example.com/data/> .

<http://example.com/mygraph/UseBulkLoaderForCSV> a rdfs:Graph ;
    rdfs:label "Chant data graph" .

chant:Chant_ID a rdfs:Property ;
    rdfs:label "Chant ID" ;
    rdfs:range xsd:string .

chant:incipit a rdfs:Property ;
    rdfs:label "Incipit" ;
    rdfs:range xsd:string .

chant:genre a rdfs:Property ;
    rdfs:label "Genre" ;
    rdfs:range xsd:URL .

chant:src_link a rdfs:Property ;
    rdfs:label "Source Link" ;
    rdfs:range xsd:URL .

--Place the mappings.ttl file in the same directory as your sampleData.csv.

4.4 Execute the scripts from terminal

isql -U dba -P mysecret -S 1111 /path/to/my_virtdb/LoaderScript.ld


All above seemed executed smoothly, however, after execution of procedures as above, then I executed select * from "DB.DBA.LOAD_LIST"; from ISQL shell or executed SPARQL SELECT * FROM <http://example.com/mygraph/UseBulkLoaderForCSV> WHERE { ?s ?p ?o } LIMIT 10;, there was no actual data found!

What's wrong? And could somebody point the right way?

candlecao commented 3 months ago

Supplemental: LoaderScript.ld is the name of the text file with suffix .ld

HughWilliams commented 3 months ago

That Chat-GPT generated output is incorrect (a hallucination), as the Virtuoso RDF Bulk Loader is used for loading datasets files in the RDF formats indicated as supported.

To bulk load multiple CSV files, Virtuoso has an equivalent CSV Bulk Loader process, as detailed in the linked documentation.

kidehen commented 3 months ago

Note: Our Virtuoso Personal Assistant, an OpenLink Personal Assistant (OPAL) module, will provide you with better guidance about CSV bulk loading into Virtuoso. Just perform the following steps:

  1. Go to https://virtuoso.openlinksw.com and wait for the chatbot window to pop up
  2. Pose your question, e.g., I am stuck trying to load CSV into Virtuoso

You should get a different response from this OPAL enclosed (and guardrailed) variant of ChatGPT, scoped to the Virtuoso Support Assistant module and its underlying Knowledge Graph.

candlecao commented 3 months ago

@HughWilliams Thank you. @kidehen Thank you. However, I did as the tutorial or the OPAL taught, I was always only stuck by being told "Undefined procedure DB.DBA.csv_register" or "Undefined procedure DB.DBA.LOADE_RDF", etc.

HughWilliams commented 3 months ago

Have you loaded the CSV Bulk Loader scripts which need to be loaded first to create the required CSV bulk loader functions, as detailed in the documentation?

candlecao commented 3 months ago

Hi, HughWilliams. I didn't until you told me. But As I pasted the whole content underneath the headline CSV Bulk Loader scripts in ISQL shell, then executed it, it prompted: image

kidehen commented 3 months ago

Here's the complete script.

-- DROP TABLE IF EXISTS csv_load_list;
CREATE TABLE csv_load_list (
    cl_file VARCHAR, 
    cl_file_in_zip VARCHAR,
    cl_state INT DEFAULT 0, 
    cl_error LONG VARCHAR, 
    cl_table VARCHAR, 
    cl_options ANY,
    cl_started DATETIME, 
    cl_done DATETIME,
    PRIMARY KEY (cl_file, cl_file_in_zip))
CREATE INDEX cl_state ON csv_load_list (cl_state);

CREATE PROCEDURE csv_cols_cb (
    INOUT  r    ANY, 
    IN     inx  INT, 
    INOUT  cbd  ANY)
{
  IF (cbd IS NULL)
    cbd := VECTOR ();
  cbd := vector_concat (cbd, VECTOR (r));   
}

CREATE PROCEDURE csv_get_cols_array (
    INOUT  ss    ANY, 
    IN     hr    INT, 
    IN     offs  INT, 
    IN     opts  ANY)
{
  DECLARE h, res ANY;
  DECLARE inx, j, ncols, no_head INT;
  h := NULL;
  no_head := 0;
  IF (hr < 0)
  {
    no_head := 1;
    hr := 0;
  }
  IF (offs < 0)
    offs := 0;
  res := VECTOR ();
  csv_parse (ss, 'DB.DBA.csv_cols_cb', h, 0, offs + 10, opts);
  IF (h IS NOT NULL AND LENGTH (h) > offs)
  {
    DECLARE _row ANY;
    _row := h[hr];
    FOR (j := 0; j < LENGTH (_row); j := j + 1)           
    {
      res := vector_concat (res, VECTOR (VECTOR (SYS_ALFANUM_NAME (CAST (_row[j] AS VARCHAR)), NULL)));
    }
    FOR (inx := offs; inx < LENGTH (h); inx := inx + 1)
    { 
      _row := h[inx];
      FOR (j := 0; j < LENGTH (_row); j := j + 1)           
      {
        IF (res[j][1] IS NULL AND NOT (ISSTRING (_row[j]) AND _row[j] = '') AND _row[j] IS NOT NULL)
          res[j][1] := __tag (_row[j]);
        ELSE IF (__tag (_row[j]) <> res[j][1] AND 189 = res[j][1] AND (ISDOUBLE (_row[j]) OR isfloat (_row[j])))
          res[j][1] := __tag (_row[j]);
        ELSE IF (__tag (_row[j]) <> res[j][1] AND ISINTEGER (_row[j]) AND (res[j][1] = 219 OR 190 = res[j][1]))
          ;  
        ELSE IF (__tag (_row[j]) <> res[j][1])
          res[j][1] := -1;
      }
    } 
  }
  FOR (inx := 0; inx < LENGTH (res); inx := inx + 1)
  { 
    IF (NOT ISSTRING (res[inx][0]) AND NOT ISNULL (res[inx][0]))
      no_head := 1; 
    ELSE IF (trim (res[inx][0]) = '' OR ISNULL (res[inx][0]))
      res[inx][0] := sprintf ('COL%d', inx);     
  }  
  FOR (inx := 0; inx < LENGTH (res); inx := inx + 1)
  { 
    IF (res[inx][1] = -1 OR res[inx][1] IS NULL)
      res[inx][1] := 'VARCHAR';  
    ELSE
      res[inx][1] := dv_type_title (res[inx][1]);    
  }  
  IF (no_head)
  {
    FOR (inx := 0; inx < LENGTH (res); inx := inx + 1)
    { 
      res[inx][0] := sprintf ('COL%d', inx);
    }
  }
  RETURN res;
}
candlecao commented 3 months ago

Hi, HughWilliams. I did it! (--just by dividing those scripts into segments by ";" and executing them one by one) Thank you. But, after executing--

csv_register ('./CSV', '*.gz');
csv_loader_run ();

I only saw the data imported as CSV format stored in Database>SQL Database Objects. How can I further directly convert it into RDF using Bulk Loader?

candlecao commented 3 months ago

Here's the complete script.

-- DROP TABLE IF EXISTS csv_load_list;
CREATE TABLE csv_load_list (
    cl_file VARCHAR, 
    cl_file_in_zip VARCHAR,
    cl_state INT DEFAULT 0, 
    cl_error LONG VARCHAR, 
    cl_table VARCHAR, 
    cl_options ANY,
    cl_started DATETIME, 
    cl_done DATETIME,
    PRIMARY KEY (cl_file, cl_file_in_zip))
CREATE INDEX cl_state ON csv_load_list (cl_state);

CREATE PROCEDURE csv_cols_cb (
    INOUT  r    ANY, 
    IN     inx  INT, 
    INOUT  cbd  ANY)
{
  IF (cbd IS NULL)
    cbd := VECTOR ();
  cbd := vector_concat (cbd, VECTOR (r));   
}

CREATE PROCEDURE csv_get_cols_array (
    INOUT  ss    ANY, 
    IN     hr    INT, 
    IN     offs  INT, 
    IN     opts  ANY)
{
  DECLARE h, res ANY;
  DECLARE inx, j, ncols, no_head INT;
  h := NULL;
  no_head := 0;
  IF (hr < 0)
  {
    no_head := 1;
    hr := 0;
  }
  IF (offs < 0)
    offs := 0;
  res := VECTOR ();
  csv_parse (ss, 'DB.DBA.csv_cols_cb', h, 0, offs + 10, opts);
  IF (h IS NOT NULL AND LENGTH (h) > offs)
  {
    DECLARE _row ANY;
    _row := h[hr];
    FOR (j := 0; j < LENGTH (_row); j := j + 1)           
    {
      res := vector_concat (res, VECTOR (VECTOR (SYS_ALFANUM_NAME (CAST (_row[j] AS VARCHAR)), NULL)));
    }
    FOR (inx := offs; inx < LENGTH (h); inx := inx + 1)
    { 
      _row := h[inx];
      FOR (j := 0; j < LENGTH (_row); j := j + 1)           
      {
        IF (res[j][1] IS NULL AND NOT (ISSTRING (_row[j]) AND _row[j] = '') AND _row[j] IS NOT NULL)
          res[j][1] := __tag (_row[j]);
        ELSE IF (__tag (_row[j]) <> res[j][1] AND 189 = res[j][1] AND (ISDOUBLE (_row[j]) OR isfloat (_row[j])))
          res[j][1] := __tag (_row[j]);
        ELSE IF (__tag (_row[j]) <> res[j][1] AND ISINTEGER (_row[j]) AND (res[j][1] = 219 OR 190 = res[j][1]))
          ;  
        ELSE IF (__tag (_row[j]) <> res[j][1])
          res[j][1] := -1;
      }
    } 
  }
  FOR (inx := 0; inx < LENGTH (res); inx := inx + 1)
  { 
    IF (NOT ISSTRING (res[inx][0]) AND NOT ISNULL (res[inx][0]))
      no_head := 1;   
    ELSE IF (trim (res[inx][0]) = '' OR ISNULL (res[inx][0]))
      res[inx][0] := sprintf ('COL%d', inx);   
  }  
  FOR (inx := 0; inx < LENGTH (res); inx := inx + 1)
  { 
    IF (res[inx][1] = -1 OR res[inx][1] IS NULL)
      res[inx][1] := 'VARCHAR';    
    ELSE
      res[inx][1] := dv_type_title (res[inx][1]);  
  }  
  IF (no_head)
  {
    FOR (inx := 0; inx < LENGTH (res); inx := inx + 1)
    { 
      res[inx][0] := sprintf ('COL%d', inx);
    }
  }
  RETURN res;
}

Thank you! I did it successfully.

HughWilliams commented 3 months ago

Now that you have successfully bulk loaded the CSV data into Virtuoso as SQL relational tables, you have to convert those tables to RDF Linked Data Views, as detailed in this Generation of Linked Data Views over Relational Data Sources with Virtuoso document.

candlecao commented 3 months ago

Thank you very much!