usc-isi-i2 / table-linker

Table Linker
MIT License
21 stars 8 forks source link

Coverage Status Tests

« Home / Command Line Interface

Table-Linker: this is an entity linkage tool which links the given string to wikidata Q nodes. This document describes the command-line interface for the Table Linker (tl) system.

Installation Instructions

Run the following commands in order in a terminal,

git clone https://github.com/usc-isi-i2/table-linker
cd table-linker

python3 -m venv tl_env
source tl_env/bin/activate
pip install -r requirements.txt

pip install -e .

If python3 is not installed, find out what version of python 3 is installed and use that instead.

Alternatively, install using pip

python3 -m venv tl_env
source tl_env/bin/activate

pip install table-linker

Install via Docker

git clone https://github.com/usc-isi-i2/table-linker
cd table-linker
  1. Build the Docker image
    docker build -t table-linker .
  2. Run Docker container
    docker run \
    -v <local path on host machine with files to wikify>:/data  \
    -it -p 3322:3322 table-linker \
    /bin/bash -c "jupyter lab --ip='*' --port=3322 --allow-root --no-browser --notebook-dir /table-linker/notebooks"

Pipelines

The tl CLI works by pushing CSV data through a series of commands, starting with a single input on stdin and ending with a single output on stdout. This pipeline feature allows construction of pipelines for linking table cells to a knowledge graph (KG).

Usage: tl [OPTIONS] COMMAND [ / COMMAND]*

Table of Contents:

*Note: only the commands marked with are currently implemented**

Options:

Common Options

These are options that can appear in different commands. We list them here so that options with the same meaning use the same character.

Error handling

In case of an error in any of the commands in the tl pipeline, the responsible command will print out the error details, an error code and, the pipeline will halt.

Error details

Error details will contain the following information

Example

Command: get-exact-matches
Error Message:
 Traceback (most recent call last):
  File "get_candidates.py", line 7, in <module>
    raise HTTPUnAuthorizedError(msg)
  HTTP 403: Unauthorized attempt to connect to Elasticsearch
Error Code: 403

Commands On Raw Input Files

canonicalize[OPTIONS]

translate an input CSV or TSV file to canonical form

Options:

Examples:

   # Build a canonical file to link the 'people' and 'country' columns in the input file
   $ tl canonicalize -c people,country < input.csv > canonical-input.csv
   $ cat input.csv | tl canonicalize -c people,country > canonical-input.csv

   # Same, but using column as index to specify the country column
   $ tl canonicalize -c people,3 < input.csv > canonical-input.csv

File Example:

# Consider the following input file,
$ cat countries.csv

country        capital_city phone_code
Hungary        Buda’pest    +49
Czech Republic Prague       +420
United Kingdom London!      +44

# canonicalize the input file and process columns country and capital_city
$ tl canonicalize -c capital_city --csv countries.csv > countries_canonical.csv
$ cat countries_canonical.csv

column row label
1      0   Buda’pest
1      1   Prague
1      2   London!

$ cat chief_subset.tsv

col0  col1  col2
Russia  Pres. Vladimir Vladimirovich PUTIN
Russia  Premier Dmitriy Anatolyevich MEDVEDEV
Russia  First Dep. Premier  Anton Germanovich SILUANOV
Russia  Dep. Premier  Maksim Alekseyevich AKIMOV
Russia  Dep. Premier  Yuriy Ivanovich BORISOV
Russia  Dep. Premier  Konstatin Anatolyevich CHUYCHENKO
Russia  Dep. Premier  Tatyana Alekseyevna GOLIKOVA

# canonicalize the input file and process col2 with adding extra information
$ tl canonicalize -c col2  --add-context chief_subset.tsv > organizations_subset_col0_canonicalized.csv

# note that we get an extra column here, which is the information from the input file, combined by `|`
$ cat organizations_subset_col0_canonicalized.csv

column,row,label,context
2,0,Vladimir Vladimirovich PUTIN,Russia|Pres.
2,1,Dmitriy Anatolyevich MEDVEDEV,Russia|Premier
2,2,Anton Germanovich SILUANOV,Russia|First Dep. Premier
2,3,Maksim Alekseyevich AKIMOV,Russia|Dep. Premier
2,4,Yuriy Ivanovich BORISOV,Russia|Dep. Premier
2,5,Konstatin Anatolyevich CHUYCHENKO,Russia|Dep. Premier
2,6,Tatyana Alekseyevna GOLIKOVA,Russia|Dep. Premier

Implementation

Assign zero based indices to the input columns and corresponding rows. The columns are indexed from left to right and rows from top to bottom. The first row is column header, the first data row is assigned index 0.

Commands On Canonical Files

Canonical Cell files contain one row per cell to be linked.

clean[OPTIONS]

The clean command cleans the cell values in a column, creating a new column with the clean values. The clean command performs two types of cleaning:

The clean command produces a file in the Canonical Cells format

Options:

Examples:

   # Clean the values in column 'label' using the default settings,
   # creating a column 'label_clean' with the clean values.
   $ tl clean -c label < canonical-input.csv

   # Remove all types of parenthesis from the label.
   $ tl clean -c label -o clean --symbols "(){}[]" --replace-by-space no < canonical-input.csv

    # Clean the values in column 'label', output column 'clean_labels', keeping the original values
    $ tl clean -c label -o clean_labels --keep-original yes canonical_input.csv

File Example:

# Consider the canonical file, countries_canonical.csv
$ cat countries_canonical.csv

column row label
1      0   Buda’pest
1      1   Prague
1      2   London!

# clean the column label and delete the default characters
$ tl clean -c label -o clean_labels --replace-by-space no countries_canonical.csv

column row label          clean_labels
1      0   Buda’pest      Budapest
1      1   Prague         Prague
1      2   London!        London

Candidate Generation Commands

Candidate Generation commands use external indices or APIs to retrieve candidate links for cells in a column. tl supports several strategies for generating candidates.

All candidate generation commands take a column in a Canonical Cells file as input and produce a set of KG identifiers for each row in a canonical file and candidates are stored one per row. A method column records the name of the strategy that produced a candidate.

When a cell contains a |-separated string (e.g., Pedro|Peter, the string is split by | and candidates are fetched for each of the resulting values.

Candidate Generation commands output a file in Candidates format

get-exact-matches[OPTIONS]

This command retrieves the identifiers of KG entities whose label or aliases match the input values exactly.

Options:

This command will add the column kg_labels to record the labels and aliases of the candidate knowledge graph object. In case of missing labels or aliases, an empty string "" is recorded. A | separated string represents multiple labels and aliases. The values to be added in the column kg_labels are retrieved from the Elasticsearch index based on the -p option as defined above.

This command will also add the column kg_descriptions to record english descriptions of the candidate knowledge graph object. In case of missing description, an empty string "" is recorded. A '|' separated string represents multiple english descriptions.

The string exact-match is recorded in the column method to indicate the source of the candidates.

The Elasticsearch queries return a score which is recorded in the column retrieval_score. The scores are stored in the field _score in the retrieved Elasticsearch objects.

The identifiers for the candidate knowledge graph objects returned by Elasticsearch are recorded in the column kg_id. The identifiers are stored in the field _id in the retrieved Elasticsearch objects.

Examples:

   # generate candidates for the cells in the column 'label_clean'
   $ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd  get-exact-matches -c label_clean  < canonical-input.csv

   # clean the column 'label' and then generate candidates for the resulting column 'label_clean' with case insensitive matching
   $ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd clean -c label / get-exact-matches -c label_clean -i  < canonical-input.csv

File Example:

# generate candidates for the canonical file, countries_canonical.csv
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd  get-exact-matches -c clean_labels  < countries_canonical.csv > countries_candidates.csv
$ cat countries_candidates.csv

column row label     clean_labels kg_id     kg_labels                             method      retrieval_score
1      0   Buda’pest Budapest     Q1781     Budapest|Buda Pest|Buda-Pest|Buda     exact-match 15.43
1      0   Buda’pest Budapest     Q16467392 Budapest (chanson)                    exact-match 14.07
1      0   Buda’pest Budapest     Q55420238 Budapest|Budapest, a song             exact-match 13.33
1      1   Prague    Prague       Q1085     Prague|Praha|Praha|Hlavní město Praha exact-match 15.39
1      1   Prague    Prague       Q1953283  Prague, Oklahoma                      exact-match 14.44
1      1   Prague    Prague       Q2084234  Prague, Nebraska                      exact-match 13.99
1      1   Prague    Prague       Q5969542  Prague                                exact-match 14.88
1      2   London!   London       Q84       London|London, UK|London, England     exact-match 13.88
1      2   London!   London       Q92561    London ON                             exact-match 12.32

Implementation

The get-exact-matches command will be implemented using an ElasticSearch index built using an Edges file in KGTK format. Two ElasticSearch term queries are defined, one for exact match retrieval and one for case-insensitive exact match retrieval.

get-phrase-matches[OPTIONS]

retrieves the identifiers of KG entities base on phrase match queries.

Options:

This command will add the column kg_labels to record the labels and aliases of the candidate knowledge graph object. In case of missing labels or aliases, an empty string "" is recorded. A | separated string represents multiple labels and aliases. The values to be added in the column kg_labels are retrieved from the Elasticsearch index based on the -p option as defined above.

The string phrase-match is recorded in the column method to indicate the source of the candidates.

The Elasticsearch queries return a score which is recorded in the column retrieval_score. The scores are stored in the field _score in the retrieved Elasticsearch objects.

The identifiers for the candidate knowledge graph objects returned by Elasticsearch are recorded in the column kg_id. The identifiers are stored in the field _id in the retrieved Elasticsearch objects.

The filter arg is optional, if given, it will execute the operation specified in the string and remove the rows which not fit the requirement. If after removing, no candidates for this (column, row) pair left, it will append the phrase match results generated, otherwise nothing will be appended. Examples:

   # generate candidates for the cells in the column 'label_clean'
   $ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd  get-phrase-matches -c label_clean  < canonical-input.csv

   # generate candidates for the resulting column 'label_clean' with property alias boosted to 1.5 and fetch 20 candidates per query
   $ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd get-phrase-matches -c label_clean -p "alias^1.5"  -n 20 < canonical-input.csv

   # generate candidates for the cells in the column 'label_clean' with exact-match method and normalized the score
   # then filter the results of exact-match with score less than 0.9 and add candaites found from phrase-match
   $ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd clean -c label \
     / get-exact-matches -c label_clean / normalize-scores -c retrieval_score \
     / get-phrase-matches -c label_clean -n 5 --filter "retrieval_score_normalized > 0.9"

File Example:

# generate candidates for the canonical file, countries_canonical.csv
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd  get-phrase-matches -c clean_labels  < countries_canonical.csv > countries_candidates.csv
$ cat countries_candidates.csv

column  row  label      clean_labels  kg_id      kg_labels                                        method        retrieval_score
1       0    Buda’pest  Budapest      Q603551    Budapest|Budapest Georgia                        phrase-match  42.405098
1       0    Buda’pest  Budapest      Q20571386  .budapest|dot budapest                           phrase-match  42.375305
1       1    Prague     Prague        Q2084234   Prague|Prague  Nebraska                          phrase-match  37.18586
1       1    Prague     Prague        Q1953283   Prague|Prague Oklahoma                           phrase-match  36.9689
1       2    London!    London        Q261303    London|London                                    phrase-match  33.492584
1       2    London!    London        Q23939248  London|Greater London|London region              phrase-match  33.094616
0       0    Hungary    Hungary       Q5943060   Hungary|European Parliament election in Hungary  phrase-match  33.324196
0       0    Hungary    Hungary       Q40662208  CCC Hungary|Cru Hungary                          phrase-match  30.940805

get-kgtk-search-matches[OPTIONS]

uses KGTK search API to retrieve identifiers of KG entities matching the input search term.

Options:

This command will add the column kg_labels to record the labels and aliases of the candidate knowledge graph object. In case of missing labels or aliases, an empty string "" is recorded. A | separated string represents multiple labels and aliases. The values to be added in the column kg_labels are retrieved from the KGTK search API.

The string kgtk-search is recorded in the column method to indicate the source of the candidates.

The KGTK API returns a score which is recorded in the column retrieval_score, by default. The scores are stored in the field score in the retrieved KGTK Search objects.

The identifiers for the candidate knowledge graph objects returned by the KGTK Search API are recorded in the column kg_id. The identifiers are stored in the field qnode in the retrieved objects.

Examples:

   # generate candidates for the cells in the column 'label_clean'
   $ tl get-kgtk-search-matches -c clean_label  < canonical-input.csv

   # generate candidates for the resulting column 'label_clean', record score in a column named `kgtk_score` and fetch 100 candidates per query
   $ tl get-kgtk-search-matches -c clean_label -o kgtk_score -n 100 < canonical-input.csv

File Example:

# generate candidates for the canonical file, countries_canonical.csv
$ tl get-kgtk-search-matches -c clean_label -n 5 < countries_canonical.csv > countries_candidates.csv
$ cat countries_candidates.csv

column  row  label      clean_label  kg_id      pagerank                kg_labels                 method             retrieval_score
1       0    Buda’pest  Buda'pest    Q1781      3.024635812034009e-05   Budapest                  kgtk-search        6.0555077
1       0    Buda’pest  Buda'pest    Q390287    1.6043048855756725e-06  Eötvös Loránd University  kgtk-search        0.113464035
1       0    Buda’pest  Buda'pest    Q330195    1.8786205914524693e-07  Budapest District IV      kgtk-search        0.032946322
1       0    Buda’pest  Buda'pest    Q11384977  1.9704309143294065e-07  Budapest District XVIII   kgtk-search        0.028489502
1       0    Buda’pest  Buda'pest    Q851057    6.023225393167536e-08   Budapest District XX      kgtk-search        0.009545079
1       1    Prague     Prague       Q1085      0.00018344224711178576  Prague                    kgtk-search        2775.5046
1       1    Prague     Prague       Q1953283   3.114336919518117e-07   Prague                    kgtk-search        4.712032
1       1    Prague     Prague       Q3563550   1.795483402201142e-05   "University in Prague"    kgtk-search        0.92587674
1       1    Prague     Prague       Q2444636   7.4743621100407685e-06  Prague 2                  kgtk-search        0.8236602
1       1    Prague     Prague       Q31519     2.1206315414017163e-05  Charles University        kgtk-search        0.55166924
1       2    London!    London       Q84        0.0001293721468732613   London                    kgtk-search        1720.4109
1       2    London!    London       Q23939248  2.376990720977285e-06   London                    kgtk-search        31.609592
1       2    London!    London       Q92561     2.016176229692049e-06   London                    kgtk-search        26.811426
1       2    London!    London       Q935090    6.648478700956284e-07   London Recordings         kgtk-search        8.84125
1       2    London!    London       Q1281978   6.987015900462481e-08   London                    kgtk-search        0.92914426

get-fuzzy-matches[OPTIONS]

retrieves the identifiers of KG entities base on fuzzy match queries.

Options:

This command will add the column kg_labels to record the labels and aliases of the candidate knowledge graph object. In case of missing labels or aliases, an empty string "" is recorded. A | separated string represents multiple labels and aliases. The values to be added in the column kg_labels are retrieved from the Elasticsearch index based on the -p option as defined above.

The string fuzzy-match is recorded in the column method to indicate the source of the candidates.

The Elasticsearch queries return a score which is recorded in the column retrieval_score. The scores are stored in the field _score in the retrieved Elasticsearch objects.

The identifiers for the candidate knowledge graph objects returned by Elasticsearch are recorded in the column kg_id. The identifiers are stored in the field _id in the retrieved Elasticsearch objects.

Examples:

   # generate candidates for the cells in the column 'label_clean'
   $ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd  get-fuzzy-matches -c label_clean  < canonical-input.csv

   # generate candidates for the resulting column 'label_clean' with property alias boosted to 1.5 and fetch 20 candidates per query
   $ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd get-fuzzy-matches -c label_clean -p "alias^1.5"  -n 20 < canonical-input.csv

   # generate candidates for the cells in the column 'label_clean' with exact-match method and fuzzy-match
   # then normalized the score
   $ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd clean -c label \
     / get-exact-matches -c label_clean \
     / get-fuzzy-matches -c label_clean -n 5 --filter \
     / normalize-scores -c retrieval_score

File Example:

# generate candidates for the canonical file, countries_canonical.csv
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd  get-phrase-matches -c clean_labels  < countries_canonical.csv > countries_candidates.csv
$ cat countries_candidates.csv

column  row  label      clean_labels  kg_id      kg_labels                                        method        retrieval_score
1       0    Buda’pest  Budapest      Q603551    Budapest|Budapest Georgia                        phrase-match  42.405098
1       0    Buda’pest  Budapest      Q20571386  .budapest|dot budapest                           phrase-match  42.375305
1       1    Prague     Prague        Q2084234   Prague|Prague  Nebraska                          phrase-match  37.18586
1       1    Prague     Prague        Q1953283   Prague|Prague Oklahoma                           phrase-match  36.9689
1       2    London!    London        Q261303    London|London                                    phrase-match  33.492584
1       2    London!    London        Q23939248  London|Greater London|London region              phrase-match  33.094616
0       0    Hungary    Hungary       Q5943060   Hungary|European Parliament election in Hungary  phrase-match  33.324196
0       0    Hungary    Hungary       Q40662208  CCC Hungary|Cru Hungary                          phrase-match  30.940805

Implementation

Using fuzzy match base on the edit distance, for example, if a input query string is Gura, possible candidate could be: Guma, Guna and Guba... Those string has edit distance value 1 to the original input. The smaller edit distance value is, the higher retrieval_score will return.

get-fuzzy-augmented-matches[OPTIONS]

Uses the ElasticSearch Index which has labels and aliases present in different languages. The index also has wikipedia and wikitable anchor text. The index also has a field named redirect_text which has all the wikipedia redirects that would be mapped to the corresponding Q-Node in wikidata.

Options:

 # generate candidates for the cells in the column 'label_clean'
$ tl --es-url http://blah.com --es-index augmented_index -Ujohn -Ppwd  get-fuzzy-augmented-matches -c label_clean canonical-input.csv > ccandidates_output.csv

File Example:

$ tl clean -c label -o label_clean canonical-input.csv / get-fuzzy-augmented-matches --es-url http://blah.com --es-index augmented_index -c label_clean > candidates_output.csv

column,row,label,label_clean,kg_id,kg_labels,method,retrieval_score
1,0,Hank Aaron,Hank Aaron,Q215777,Hank Aaron,fuzzy-augmented,37.63053
1,0,Hank Aaron,Hank Aaron,Q47513596,Hank Aaron,fuzzy-augmented,16.903837
1,0,Hank Aaron,Hank Aaron,Q1518478,Hank Aaron Award,fuzzy-augmented,19.805542
1,0,Hank Aaron,Hank Aaron,Q14679126,Hank Aaron Stadium,fuzzy-augmented,28.061468
1,0,Hank Aaron,Hank Aaron,Q28453830,Hank Aaron State Trail,fuzzy-augmented,26.173532
1,0,Hank Aaron,Hank Aaron,Q92433937,Reflections on Hank Aaron,fuzzy-augmented,26.173532
1,0,Hank Aaron,Hank Aaron,Q6665277,Template:AL Hank Aaron Award Winners,fuzzy-augmented,24.523617
1,0,Hank Aaron,Hank Aaron,Q5648263,Hank Aaron: Chasing the Dream,fuzzy-augmented,24.523617
1,0,Hank Aaron,Hank Aaron,Q8853836,Template:NL Hank Aaron Award Winners,fuzzy-augmented,24.523617
1,0,Hank Aaron,Hank Aaron,Q66847614,President Carter with Hank Aaron (NAID 180805),fuzzy-augmented,21.777962
1,0,Hank Aaron,Hank Aaron,Q16983107,Oak Leaf Trail,fuzzy-augmented,19.035532

Adding Features Commands

Add-Feature commands add one or more features for the candidate knowledge graph objects for the input cells. All Add-Feature commands take a column in a Candidate or a Feature file and output a Feature file.

add-text-embedding-feature[OPTIONS]

The add-text-embedding-feature command computes text embedding vectors of the candidates and similarity to rank candidates. The basic idea is to compute a vector for a column in a table and then rank the candidates for each cell by measuring similarity between each candidate vector and the column vector.

Options:

Detail explainations:

Examples:

# run text embedding command to add an extra column `embed-score` with ground-truth strategy and use all nodes to calculate centroid
$ tl add-text-embedding-feature input_file.csv \
  --column-vector-strategy ground-truth \
  --centroid-sampling-amount 0 \
  --output-column-name embed-score

# run text embedding command to add an extra column `embed-score` with ground-truth strategy and use up to 5 nodes to calculate centroid, the generated sentence only contains label and description information. Also, apply TSNE on the embedding vectors after generated. Also, the corresponding detail vectors file will be saved to `vectors.tsv`
$ tl add-text-embedding-feature input_file.csv \
  --embedding-model bert-base-nli-mean-tokens \
  --column-vector-strategy ground-truth \
  --centroid-sampling-amount 5 \
  --isa-properties None \
  --has-properties None \
  --run-TSNE true \
  --generate-projector-file vectors.tsv

File Example:

    column  row                                          label  ...                                    GT_kg_label evaluation_label embed-score
0        0    2                        Trigeminal nerve nuclei  ...                        Trigeminal nerve nuclei                1    0.925744
1        0    3                       Trigeminal motor nucleus  ...                       Trigeminal motor nucleus                1    0.099415
2        0    4                          Substantia innominata  ...                          Substantia innominata                1    0.070117
3        0    6                                    Rhombic lip  ...                                    Rhombic lip                1    1.456694
4        0    7                                 Rhinencephalon  ...                                 Rhinencephalon                1    0.471636
5        0    9  Principal sensory nucleus of trigeminal nerve  ...  Principal sensory nucleus of trigeminal nerve                1    1.936707
6        0   12                     Nucleus basalis of Meynert  ...                     Nucleus basalis of Meynert                1    0.130171
7        0   14      Mesencephalic nucleus of trigeminal nerve  ...      Mesencephalic nucleus of trigeminal nerve                1    1.746346
8        0   17                         Diagonal band of Broca  ...                         Diagonal band of Broca                1    0.520857
9        0    1                                 Tuber cinereum  ...                                 tuber cinereum                1    0.116646
10       0    1                                 Tuber cinereum  ...                                 tuber cinereum               -1    0.192494
11       0    1                                 Tuber cinereum  ...                                 tuber cinereum               -1    0.028620

Implementation

This command mainly wrap from kgtk's text-embedding functions. please refer to kgtk's readme page here for details.

align-page-rank[OPTIONS]

Generates aligned_pagerank feature which is used in vote-by-classifier command. Aligned page rank means exact-match candidates retain page rank, fuzzy-match candidates receives 0.

Examples:

tl align-page-rank candidates.csv > aligned_candidates.csv

File Examples:

|column|row|label          |kg_id     |pagerank          |method         |aligned_pagerank|
|------|---|---------------|----------|------------------|---------------|----------------|
|1     |0  |Citigroup      |Q219508   |3.988134e-09      |exact-match    |3.988134e-09    |
|1     |1  |Bank of America|Q487907   |5.115590e-09      |exact-match    |5.115590e-09    |
|1     |1  |Bank of America|Q50316068 |5.235995e-09      |exact-match    |5.235995e-09    |
|1     |10 |BP             |Q100151423|5.115590e-09      |fuzzy-augmented|0.000000e+00    |
|1     |10 |BP             |Q131755   |5.235995e-09      |fuzzy-augmented|0.000000e+00    |

check-candidates[OPTIONS]

The check-candidates command takes a candidates/features file and returns those rows for which the ground-truth was never retrieved as a candidate. ground-truth-labeler command needs to be run previously for this command to work.

This commands follows the following procedure:

Step 1: Group the candidates dataframe by column and row.

Following is a snippet of the input file.

column row label context label_clean kg_id kg_labels kg_aliases method kg_descriptions pagerank retrieval_score GT_kg_id GT_kg_label evaluation_label
0 4 Salceto Saliceto|Cortemilia-Saliceto Salceto Q197728 Santiago Salcedo "Santiago Gabriel Salcedo|Santiago Gabriel Salcedo Gonzalez|S. Salcedo|S. G. S. González|Santiago G. Salcedo González|González, S. G. S.|Santiago Gabriel Salcedo González|Santiago Gabriel S. González|Salcedo, S." fuzzy-augmented Paraguayan association football player 3.976872442613597e-09 16.31549 Q52797639 Saliceto -1
0 4 Salceto Saliceto|Cortemilia-Saliceto Salceto Q19681762 Saúl Salcedo "Saul salcedo|Saul Salcedo|Saúl Savín Salcedo Zárate|S. Salcedo|Saul Savin Salcedo Zarate|Salcedo, S." fuzzy-augmented Paraguayan footballer 3.5396131256502836e-09 16.12341 Q52797639 Saliceto -1
0 4 Salceto Saliceto|Cortemilia-Saliceto Salceto Q12856 Salcedo Baugen fuzzy-augmented municipality of the Philippines in the province of Ilocos Sur 1.7080570334293118e-08 15.950816 Q52797639 Saliceto -1

Step 2: Check if the grouped dataframe contains a 1 in the evaluation_label column.

Step 3: If not, add the column, row, label, context, GT_kg_id, GT_kg_label to the output. If the GT_kg_description of the Qnodes are available, then append that to output.

Examples:

$ tl check-candidates input.csv

File Example:

$ tl check-candidates input.csv
column row label context GT_kg_id GT_kg_label
0 4 Salceto Saliceto|Cortemilia-Saliceto Q52797639 Saliceto

check-extra-information[OPTIONS]

The check-extra-information add a feature column by checking if any extra information from the original file get hitted and return a score base on the hitted information amount.

The program will check each node's property values and corresponding wikipedia page if exists. If any labels found there are same as the provieded extra information treat as hitted, otherwise not hitted. Usually there would be multiple columns for each input original file, we treat each coulmn as one part, the score is count(hitted_part)/ count(all_parts). Maximum score is 1 for hit all extra information provided.

Options:

Examples:

# add the extra-information feature column with external extra information file
$ tl check-extra-information input_file.csv \
  --extra-information-file extra_info.csv \
  --output-column-name extra_information_score > output_file.csv

File Example:

# add the extra-information feature column
$ tl check-extra-information input_file.csv \
  --output-column-name extra_information_score > output_file.csv
  column  row  label  ||other_information||  label_clean  ...  GT_kg_id  GT_kg_label evaluation_label  gt_embed_score  extra_information_score

2  0 Vladimir Vladimirovich PUTIN Russia|Pres. Vladimir Vladimirovich PUTIN  ... Q7747 Vladimir Putin  1  1.297309  0.5

2  0 Vladimir Vladimirovich PUTIN Russia|Pres. Vladimir Vladimirovich PUTIN  ... Q7747 Vladimir Putin -1  1.290919  0.0

2  0 Vladimir Vladimirovich PUTIN Russia|Pres. Vladimir Vladimirovich PUTIN  ... Q7747 Vladimir Putin -1  0.651267  0.0

2  0 Vladimir Vladimirovich PUTIN Russia|Pres. Vladimir Vladimirovich PUTIN  ... Q7747 Vladimir Putin -1  0.815978  0.0

2  0 Vladimir Vladimirovich PUTIN Russia|Pres. Vladimir Vladimirovich PUTIN  ... Q7747 Vladimir Putin -1  0.778838  0.0

...  ...  ...  ...  ...  ...  ... ...  ...  ... ...  ...

2 40  Vasiliy Alekseyevich NEBENZYA  Russia|Permanent Representative to the UN, New...  Vasiliy Alekseyevich NEBENZYA  ...  Q1000053  Vasily Nebenzya -1  0.950004  0.0

2 40  Vasiliy Alekseyevich NEBENZYA  Russia|Permanent Representative to the UN, New...  Vasiliy Alekseyevich NEBENZYA  ...  Q1000053  Vasily Nebenzya -1  0.763486  0.0

2 40  Vasiliy Alekseyevich NEBENZYA  Russia|Permanent Representative to the UN, New...  Vasiliy Alekseyevich NEBENZYA  ...  Q1000053  Vasily Nebenzya -1  1.219794  0.5

2 40  Vasiliy Alekseyevich NEBENZYA  Russia|Permanent Representative to the UN, New...  Vasiliy Alekseyevich NEBENZYA  ...  Q1000053  Vasily Nebenzya -1  1.225877  0.0

2 40  Vasiliy Alekseyevich NEBENZYA  Russia|Permanent Representative to the UN, New...  Vasiliy Alekseyevich NEBENZYA  ...  Q1000053  Vasily Nebenzya -1  1.185123  0.5
$ cat output_file.csv

Implementation

compute-tf-idf[OPTIONS]

The compute-tf-idf function adds a feature column by computing the tf-idf like score based on all candidates for an input column.

This commands follows the following procedure:

Step 1: Get the set of high confidence candidates. High confidence candidates are defined as candidates which has the method exact-match and count per cell is one.

Step 2: For each of the high confidence candidates get the class-count data. This data is stored in Elasticseach index and is gathered during the candidate generation step.

The data consists of q-node:count pairs where the q-node represents a class and the count is the number of instances below the class. These counts use a generalized version of is-a where occupations and position held are considered is-a, eg, Scwarzenegger is an actor.

Similarly, another dataset consists of p-node:count pairs where p-node represents a property the candidate qnode has and count is the total number of qnodes in the corpus which has this property.

Step 3: Make a set of all the classes that appear in the high confidence classes, and count the number of times each class occurs in each candidate. For example, if two high precision candidates are human, then Q5 will have num-occurrences = 2.

Step 4: Convert the instance counts for the set constructed in step 3 to IDF (see https://en.wikipedia.org/wiki/Tf–idf), and then multiply the IDF score of each class by the num-occurrences number from step 3. Then, normalize them so that all the IDF scores for the high confidence candidates sum to 1.0.

Step 5: For each candidate, including high confidence candidates, compute the tf-idf score by adding up the IDF scores (computed in Step 4), for all the classes. If the class appears in the high confidence classes, then multiple the class IDF by 1 otherwise by 0.

Options:

Examples:

$ tl compute-tf-idf --feature-file class_count.tsv \
     --feature-name class_count \
     --singleton-column singleton \
     -o class_count_tf_idf_score \
     candidates.csv

File Example:

$ tl compute-tf-idf --feature-file class_count.tsv \
     --feature-name class_count \
     --singleton-column singleton \
     -o class_count_tf_idf_score \
     candidates.csv

$ cat input_file.csv
| column | row | label       | context                                   | label_clean | kg_id      | kg_labels                | kg_aliases                             | method          | kg_descriptions                     | pagerank               | retrieval_score | singleton | 
|--------|-----|-------------|-------------------------------------------|-------------|------------|--------------------------|----------------------------------------|-----------------|-------------------------------------|------------------------|-----------------|-----------| 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q213854    | Virat Kohli              | Cheeku                                 | fuzzy-augmented | Indian cricket player               | 3.983031232217997e-09  | 36.39384        | 0         | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q102354285 | Marie Virat              |                                        | fuzzy-augmented | Ph. D. 2009                         | 5.918546005357847e-09  | 23.48463        | 0         | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q16027751  | Bernard Virat            |                                        | fuzzy-augmented | French biologist (1921-2003)        | 3.7401912005599e-09    | 23.48463        | 0         | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q7907059   | VIRAT                    |                                        | fuzzy-augmented |                                     | 0.0                    | 20.582134       | 0         | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q2978459   | Virata                   | Virat                                  | fuzzy-augmented | character from the epic Mahabharata | 6.8901323967569805e-09 | 20.520416       | 0         | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q16682735  |                          |                                        | fuzzy-augmented |                                     | 3.5396131256502836e-09 | 19.623405       | 0         | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q6426050   | Kohli                    |                                        | fuzzy-augmented |                                     | 3.5396131256502836e-09 | 19.601744       | 0         | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q46251     | Fränzi Mägert-Kohli      | Franziska Kohli\|Fraenzi Maegert-Kohli | fuzzy-augmented | Swiss snowboarder                   | 3.5396131256502836e-09 | 19.233713       | 0         | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q16434086  | Wirat Wachirarattanawong |                                        | fuzzy-augmented |                                     | 3.5396131256502836e-09 | 19.010628       | 0         | 

$ cat output_file.csv
| column | row | label       | context                                   | label_clean | kg_id      | kg_labels                | kg_aliases                             | method          | kg_descriptions                     | pagerank               | retrieval_score | singleton | class_count_tf_idf_score | 
|--------|-----|-------------|-------------------------------------------|-------------|------------|--------------------------|----------------------------------------|-----------------|-------------------------------------|------------------------|-----------------|-----------|--------------------------| 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q213854    | Virat Kohli              | Cheeku                                 | fuzzy-augmented | Indian cricket player               | 3.983031232217997e-09  | 36.39384        | 0         | 1.0000000000000002       | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q102354285 | Marie Virat              |                                        | fuzzy-augmented | Ph. D. 2009                         | 5.918546005357847e-09  | 23.48463        | 0         | 0.5442234316047089       | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q16027751  | Bernard Virat            |                                        | fuzzy-augmented | French biologist (1921-2003)        | 3.7401912005599e-09    | 23.48463        | 0         | 0.5442234316047089       | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q7907059   | VIRAT                    |                                        | fuzzy-augmented |                                     | 0.0                    | 20.582134       | 0         | 0.0                      | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q2978459   | Virata                   | Virat                                  | fuzzy-augmented | character from the epic Mahabharata | 6.8901323967569805e-09 | 20.520416       | 0         | 0.031105662154115882     | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q16682735  |                          |                                        | fuzzy-augmented |                                     | 3.5396131256502836e-09 | 19.623405       | 0         | 0.20287301482664413      | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q6426050   | Kohli                    |                                        | fuzzy-augmented |                                     | 3.5396131256502836e-09 | 19.601744       | 0         | 0.018154036805015324     | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q46251     | Fränzi Mägert-Kohli      | Franziska Kohli\|Fraenzi Maegert-Kohli | fuzzy-augmented | Swiss snowboarder                   | 3.5396131256502836e-09 | 19.233713       | 0         | 0.6945347101120541       | 
| 0      | 0   | Virat Kohli | royal challengers bangalore\|152\|5/11/88 | Virat Kohli | Q16434086  | Wirat Wachirarattanawong |                                        | fuzzy-augmented |                                     | 3.5396131256502836e-09 | 19.010628       | 0         | 0.5442234316047089       | 

Implementation

Wikidata part: achieved with the wikidata sparql query to get all properties of the Q nodes. Wikipedia part: achieved with the python pacakge wikipedia-api

context-match[OPTIONS]

The context-match function adds a feature column by matching the context values of each candidate to its properties and calculating the score based on the match.

This commands follows the following procedure:

Step 1: For every candidate in the input file, the context is present in the context column and separated by "|". Each individual context-value could represent a string, quantity or a date.

Following is a snippet of the input file.

column row label context
1 0 The Social Network 1|2010|David Fincher|8.3|45993
1 1 Inception 2|2010|Christopher Nolan|8.9|333261

The context file contains properties and their values for each candidate. Match the context values to these property values.

Following is a snippet of the context file.

qnode context
Q185888 d"2010":P577|i"(en)":P364:Q1860|i"\'merica":P495:Q30|...
Following is a snippet of a custom context file. node1 label node2
Q185888 context d"2010":P577|i"(en)":P364:Q1860|i"\'merica":P495:Q30|...

Try to match to date, quantity and then string in the order depending upon the similarity thresholds given (Dates are matched with thresholds of 1.).

Step 2: The result of matching is a property value and the similarity of matching. For each row, calculate the number of occurences for each of the properties that appear taking position in to account. Position differentiates between the context values separated by "|".

Next, calculate the cell property value by dividing the actual number of occurences (1) by the total number of occurences.

Step 3: Calculate the property value of each property by dividing the earlier calculated property value by the total number of rows in the cell.

Step 4: Calculate the score for each candidate by multiplying the property value by the corresponding similarity and summing for all the properties.

Options:

Examples:

$ tl context-match movies.csv \
     --context-file movies_context.csv \
     --similarity-quantity-threshold 0.7 \
     --similarity-string-threshold 0.5 \
     -o match_score

File Example:

$ tl context-match movies.csv \
     --context-file movies_context.csv \
     --similarity-quantity-threshold 0.7 \
     --similarity-string-threshold 0.5 \
     -o match_score
column row label context kg_id match_score
1 0 The Social Network 1|2010|David Fincher|8.3|45993 Q185888 0.6337
1 0 The Social Network 1|2010|David Fincher|8.3|45993 Q1952928 0.5324
1 1 Inception 2|2010|Christopher Nolan|8.9|333261 Q42341440 0.6894
1 1 Inception 2|2010|Christopher Nolan|8.9|333261 Q25188 0.6769
1 10 The Hangover 11|2009|Todd Phillips|7.9|154719 Q1587838 0.6337
1 10 The Hangover 11|2009|Todd Phillips|7.9|154719 Q219315 0.6337

create-pseudo-gt[OPTIONS]

The create-pseudo-gt command takes a features file and a string indicating the features and the corresponding thresholds by which pseudo ground truth needs to be computed, and creates a new feature indicating if the candidate is part of the pseudo ground truth (indicated with 1) or not (indicated with a 0).

This commands follows the following procedure:

Step 1: Read the input file and check for validity.

Following is a snippet of the input file:

column row label context label_clean kg_id kg_labels kg_aliases method kg_descriptions pagerank retrieval_score GT_kg_id GT_kg_label evaluation_label aligned_pagerank monge_elkan monge_elkan_aliases jaro_winkler levenshtein des_cont_jaccard des_cont_jaccard_normalized smallest_qnode_number num_char num_tokens singleton context_score
0 0 "Sekhmatia Union, Nazirpur" 11877|11502 "Sekhmatia Union, Nazirpur" Q22346968 "Sekhmatia Union, Nazirpur" exact-match "Union of Nazirpur Upazilla, Pirojpur District" 3.5396131256502836e-09 21.686 Q22346968 "Sekhmatia Union, Nazirpur" 1 3.5396131256502836e-09 1.0 0.0 1.0 1.0 0.0 0.0 0 25 3 1 0.89
0 0 "Sekhmatia Union, Nazirpur" 11877|11502 "Sekhmatia Union, Nazirpur" Q22346968 "Sekhmatia Union, Nazirpur" fuzzy-augmented "Union of Nazirpur Upazilla, Pirojpur District" 3.5396131256502836e-09 36.477844 Q22346968 "Sekhmatia Union, Nazirpur" 1 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0 25 3 0 0.89
0 0 "Sekhmatia Union, Nazirpur" 11877|11502 "Sekhmatia Union, Nazirpur" Q22346967 "Nazirpur Union, Nazirpur" fuzzy-augmented "Union of Nazirpur Upazilla, Pirojpur District" 3.5396131256502836e-09 25.24847 Q22346968 "Sekhmatia Union, Nazirpur" -1 0.0 0.9151234567901234 0.0 0.7677777777777778 0.64 0.0 0.0 0 24 3 0 0.0

Step 2: Set the output column name to 1 if the singleton feature is 1 or if the context score is greater than or equal to the threshold set by the user (default = 0.7).

Options:

Examples:

$ tl create-pseudo-gt input.csv\
--column-thresholds singleton:1,context_score:0.7\
-o pseudo_gt

File Example:

$ tl create-pseudo-gt input.csv\
--column-thresholds singleton:1,context_score:0.7\
-o pseudo_gt
column row label context label_clean kg_id kg_labels kg_aliases method kg_descriptions pagerank retrieval_score GT_kg_id GT_kg_label evaluation_label aligned_pagerank monge_elkan monge_elkan_aliases jaro_winkler levenshtein des_cont_jaccard des_cont_jaccard_normalized smallest_qnode_number num_char num_tokens singleton context_score pseudo_gt
0 0 "Sekhmatia Union, Nazirpur" 11877|11502 "Sekhmatia Union, Nazirpur" Q22346968 "Sekhmatia Union, Nazirpur" exact-match "Union of Nazirpur Upazilla, Pirojpur District" 3.5396131256502836e-09 21.686 Q22346968 "Sekhmatia Union, Nazirpur" 1 3.5396131256502836e-09 1.0 0.0 1.0 1.0 0.0 0.0 0 25 3 1 0.89 1.0
0 0 "Sekhmatia Union, Nazirpur" 11877|11502 "Sekhmatia Union, Nazirpur" Q22346968 "Sekhmatia Union, Nazirpur" fuzzy-augmented "Union of Nazirpur Upazilla, Pirojpur District" 3.5396131256502836e-09 36.477844 Q22346968 "Sekhmatia Union, Nazirpur" 1 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0 25 3 0 0.89 1.0
0 0 "Sekhmatia Union, Nazirpur" 11877|11502 "Sekhmatia Union, Nazirpur" Q22346967 "Nazirpur Union, Nazirpur" fuzzy-augmented "Union of Nazirpur Upazilla, Pirojpur District" 3.5396131256502836e-09 25.24847 Q22346968 "Sekhmatia Union, Nazirpur" -1 0.0 0.9151234567901234 0.0 0.7677777777777778 0.64 0.0 0.0 0 24 3 0 0.0 0.0

create-singleton-feature[OPTIONS]

The command takes as input a candidate file and filters out the candidates retrieved by exact-matches. The cells having single exact-match candidate are given a boolean label of 1. Others are a given a boolean label of 0.

The command takes one command line parameter:

Example Command

$ tl create-singleton-feature -o singleton companies_candidates.csv > companies_singletons.csv
$ cat companies_singletons.csv

|column|row|label          |kg_id     |kg_labels         |method     |singleton|
|------|---|---------------|----------|------------------|-----------|---------|
|1     |0  |Citigroup      |Q219508   |Citigroup         |exact-match|1        |
|1     |1  |Bank of America|Q487907   |Bank of America   |exact-match|0        |
|1     |1  |Bank of America|Q50316068 |Bank of America   |exact-match|0        |
|1     |10 |BP             |Q1004647  |bullous pemphigoid|exact-match|0        |
|1     |10 |BP             |Q100151423|brutal prog       |exact-match|0        |
|1     |10 |BP             |Q131755   |bipolar disorder  |exact-match|0        |
|1     |10 |BP             |Q11605804 |BlitzPlus         |exact-match|0        |
|1     |10 |BP             |Q152057   |BP                |exact-match|0        |
|1     |10 |BP             |Q27968500 |BP                |exact-match|0        |

generate-reciprocal-rank[OPTIONS]

The command takes as input a candidate file and a score column that needs to be used for generating the reciprocal rank.

The command takes the following parameters:

Example Command

$ tl generate-reciprocal-rank -c graph-embedding-score -o reciprocal_rank companies.csv > companies_reciprocal_rank.csv

$ cat companies_reciprocal_rank.csv
|column|row|label          |kg_id     |kg_labels         |method     |graph-embedding-score|reciprocal_rank   |
|------|---|---------------|----------|------------------|-----------|---------------------|------------------|
|1     |0  |Citigroup      |Q219508   |Citigroup         |fuzzy-augmented|0.8419203745525644   |1.0               |
|1     |0  |Citigroup      |Q219508   |Citigroup         |exact-match|0.8419203745525644   |0.5               |
|1     |0  |Citigroup      |Q857063   |Citibank          |fuzzy-augmented|0.7356934287270128   |0.3333333333333333|
|1     |0  |Citigroup      |Q1023765  |CIT Group         |fuzzy-augmented|0.7323310965247516   |0.25              |
|1     |0  |Citigroup      |Q856322   |CITIC Group       |fuzzy-augmented|0.7199133878669514   |0.2               |
|1     |0  |Citigroup      |Q11307286 |Citigroup Japan Holdings|fuzzy-augmented|0.7126768515646021   |0.1666666666666666|

mosaic-features[OPTIONS]

The mosaic-features command computes general features (number of characters and number of tokens) for a specified column.

The command takes the following parameters:

Example Command

$ tl mosaic-features -c kg_labels --num-char --num-tokens companies.csv > companies_mosaic.csv
$ cat companies_mosaic.csv

|column|row|label          |kg_id     |kg_labels         |method     |num_char|num_tokens        |
|------|---|---------------|----------|------------------|-----------|--------|------------------|
|1     |0  |Citigroup      |Q219508   |Citigroup         |fuzzy-augmented|9       |1                 |
|1     |0  |Citigroup      |Q781961   |One Court Square  |fuzzy-augmented|16      |3                 |
|1     |0  |Citigroup      |Q867663   |Citigroup Centre  |fuzzy-augmented|16      |2                 |
|1     |0  |Citigroup      |Q5122510  |Citigroup Global Markets Japan|fuzzy-augmented|30      |4                 |
|1     |0  |Citigroup      |Q54491    |Citigroup Centre  |fuzzy-augmented|16      |2                 |

string-similarity[OPTIONS]

The string-similarity command compares the cell values in two input columns and outputs a similarity score for each pair of participating strings in the output column.

The string-similarity command supports the following tokenizer, some of the string similarity may require to specify one of them during calculating.

The string-similarity command supports the following string similarity algorithms, all of those similarity functions are implemented from RLTK. These similarity methods are ordered in alphabet.

In future, more string similarity algorithms will be supported

Options:

The string similarity scores are added to a output columns. If the specific columns (not ["label_clean", "kg_labels"])is given, the compared column names will be added to the column name whose name will be in the format <col_1>\_<col_2>\_\<algorithm>. Otherwise the column name will only be in the format <algorithm>.

Examples:

# compute similarity score for the columns 'clean_labels' and 'kg_label', use Normalized Levenshtein, case sensitive comparison
$ tl string-similarity --method levenshtein < countries_candidates.csv

# compute similarity score for the columns 'doc_labels' and 'doc_aliases', use Jaccard similarity based on ngram=3 tokenizer, tf-idf score with word tokenizer and Needleman similarity, case insensitive comparison
$ tl string-similarity -c doc_labels,doc_aliases  --method jaccard:tokenizer=ngram:tokenizer_n=3 tfidf:tokenizer=word needleman countries_candidates.csv

File Example:

# compute string similarity between the columns 'clean_labels' and 'kg_labels', using case sensitive Normalized Levenshtein
# for the file countries_candidates.csv, exclude columns 'label','method' and 'retrieval_score' while printing
$ tl string-similarity -c clean_labels,kg_labels --lev < countries_candidates.csv > countries_ss_features.csv \
&& mlr --opprint cut -f label,method,retrieval_score -x countries_ss_features.csv

column row clean_labels kg_id     kg_labels                             clean_labels_kg_labels_LevenshteinSimilarity()
1      0   Budapest     Q1781     Budapest|Buda Pest|Buda-Pest|Buda     1
1      0   Budapest     Q16467392 Budapest (chanson)                    0.44
1      0   Budapest     Q55420238 Budapest|Budapest, a song             1
1      1   Prague       Q1085     Prague|Praha|Praha|Hlavní město Praha 1
1      1   Prague       Q1953283  Prague, Oklahoma                      0.375
1      1   Prague       Q2084234  Prague, Nebraska                      0.375
1      1   Prague       Q5969542  Prague                                1
1      2   London       Q84       London|London, UK|London, England     1
1      2   London       Q92561    London ON                             0.66

Implementation

For any input cell value, s and a candidate c, String similarity outputs a score computed as follows,

stringSimilarity(s, c) := max(similarityFunction(s, l)) ∀ l ∈ { labels(c) }

smallest-qnode-number[OPTIONS]

The smallest-qnode-number command adds a new feature column named smallest_qnode_number where for each candidate set, the candidate with the smallest qnode number (numeric) receives 1 for this feature while other candidates receive 0.

Examples:

tl smallest-qnode-number input_file.csv > output_file.csv

File Example:

column row clean_labels kg_id     kg_labels                             smallest_qnode_number
1      0   Budapest     Q1781     Budapest|Buda Pest|Buda-Pest|Buda     1
1      0   Budapest     Q16467392 Budapest (chanson)                    0
1      0   Budapest     Q55420238 Budapest|Budapest, a song             0
1      1   Prague       Q1085     Prague|Praha|Praha|Hlavní město Praha 1
1      1   Prague       Q1953283  Prague, Oklahoma                      0
1      1   Prague       Q2084234  Prague, Nebraska                      0
1      1   Prague       Q5969542  Prague                                0
1      2   London       Q84       London|London, UK|London, England     1
1      2   London       Q92561    London ON                             0

merge-columns[OPTIONS]

The merge-columns command merges values from two or more columns and outputs the concatenated value in the output column.

Options:

same as above but remove duplicates

$ tl merge-columns -c doc_label,doc_aliases -o doc_label_aliases --remove-duplicates yes < doc_details.csv


**File Example:**

$ tl merge-columns -c kg_label,kg_aliases -o kg_label_aliases --remove-duplicates yes < countries_candidates_v2.csv

column row label clean_labels kg_id kg_label kg_aliases kg_label_aliases 1 0 Buda’pest Budapest Q1781 Budapest Buda Pest|Buda-Pest|Buda Budapest|Buda Pest|Buda-Pest|Buda 1 0 Buda’pest Budapest Q16467392 Budapest (chanson) "" Budapest (chanson) 1 0 Buda’pest Budapest Q55420238 Budapest Budapest, a song Budapest|Budapest, a song 1 1 Prague Prague Q1085 Prague|Praha Praha|Hlavní město Praha Prague|Praha|Hlavní město Praha 1 1 Prague Prague Q1953283 Prague, Oklahoma "" Prague, Oklahoma 1 1 Prague Prague Q2084234 Prague, Nebraska "" Prague, Nebraska 1 1 Prague Prague Q5969542 Prague "" Prague 1 2 London! London Q84 London London, UK|London, England London|London, UK|London, England 1 2 London! London Q92561 London ON "" London ON


<a name="command_normalize-scores" />

### [`normalize-scores`](#command_normalize-scores)` [OPTIONS]`
The `normalize-score` command normalizes the retrieval scores for all the candidate knowledge graph objects for each retrieval method for all input cells in a column.
This command will find the maximum retrieval score for candidates generated by a retrieval method,
and then divide the individual candidate retrieval scores by the maximum retrieval score for that method for each input column.

Note that the column containing the retrieval method names is `method`, added by the [get-exact-matches](#command_get-exact-matches) command.

**Options:**
- `-c a`: column name which has the retrieval scores. Default is `retrieval_score`

- `-o a`: the output column name where the normalized scores will be stored. Default is input column name appended with the suffix `_normalized`

- `-t | --normalization-type`: Accepts two types of normalization that can be applied to the scores. The types accepted are

   - `max_norm` which normalizes by dividing by the max value present in the column

   - `zscore` which does the zscore normlaization. `score(i) - mean(score)/standard_deviation(score)`

      By default the type is `max_norm`

- `-w|--weights`: a comma separated string of the format `<retrieval_method_1:<weight_1>, <retrieval_method_2:<weight_2>,...>`
 specifying the weights for each retrieval method. By default, all retrieval method weights are set to 1.0

**Examples:**
```bash
# compute normalized scores with default options
$ tl normalize-scores < countries_candidates.csv > countries_candidates_normalized.csv

# compute normalized scores for the column 'es_score', output in the column 'normalized_es_scores' with specified weights
$ tl normalize-scores -c es_score -o normalized_es_scores -t max_norm -w 'es_method_1:0.4,es_method_2:0.92' countries_candidates.csv

File Example:

# compute normalized score for the column 'retrieval_score', output in the column 'normalized_retrieval_scores' with specified weights
$ tl normalize-scores -c retrieval_score -o normalized_retrieval_scores -w 'phrase-match:0.5' < countries_candidates.csv | mlr --opprint cut -f kg_label,kg_aliases -x

column row label     clean_labels kg_id     method       retrieval_score normalized_retrieval_scores
1      0   Buda’pest Budapest     Q1781     phrase-match 20.43           0.316155989
1      0   Buda’pest Budapest     Q16467392 phrase-match 12.33           0.190807799
1      0   Buda’pest Budapest     Q55420238 phrase-match 18.2            0.281646549
1      1   Prague    Prague       Q1085     phrase-match 15.39           0.23816156
1      1   Prague    Prague       Q1953283  phrase-match 14.44           0.223460229
1      1   Prague    Prague       Q2084234  phrase-match 13.99           0.216496441
1      1   Prague    Prague       Q5969542  phrase-match 9.8             0.151655834
1      2   London!   London       Q84       phrase-match 32.31           0.5
1      2   London!   London       Q92561    phrase-match 25.625          0.396549056

Implementation

For each retrieval method m and the candidate set C for a column,

maxRetrievalScore(m) := max(retrievalScore(C))

Then, for all candidates c, in the candidates set C, generated by retrieval method m,

normalizedRetrievalScore(c) := (retrievalScore(c) / maxRetrievalScore(m)) * weight(m)

Where weight(m) is specified by users, defaulting to 1.0

score-using-embedding[OPTIONS]

The score-using-embedding command uses pre-computed embedding vectors to score (rank) candidates. The source of the pre-computed embeddings can be a tsv file, or an Elasticsearch server.

If both tsv file and elasticsearch server are provided, the tsv is tried first then the Elasticsearch server. Embedding vectors hits from the Elasticsearch server are append to the tsv file.

Currently, there are two strategies for ranking:

Options:

feature-voting[OPTIONS]

The feature-voting command takes user specified feature column names, tabulate the votes in each feature column, and add votes column to output dataframe.

Example:

tl smallest-qnode-number input_file.csv / string-similarity -i --method monge_elkan:tokenizer=word -o monge_elkan / string-similarity -i --method jaccard:tokenizer=word -c description context -o des_cont_jaccard / feature-voting -c "pagerank,smallest_qnode_number,monge_elkan,des_cont_jaccard" > output_file.csv

Options:

Ranking Candidate Commands

Ranking Candidate commands rank the candidate for each input cell. All Ranking Candidate commands takes as input a file in Feature format and output a file in Ranking Score format.

combine-linearly[OPTIONS]

Linearly combines two or more score-columns for candidate knowledge graph objects for each input cell value. Takes as input weights for columns being combined to adjust influence.

Options:

Examples:

# linearly combine the columns 'normalized_score' and 'clean_labels_kg_labels_lev' with respective weights as '1.5' and '2.0'
$ tl combine_linearly -w normalized_score:1.5,clean_labels_kg_labels_lev:2.0 -o ranking_score < countries_features.csv > countries_features_ranked.csv

File Examples:

# consider the features file, countries_features.csv (some columns might be missing for simplicity)
$ cat countries_features.csv

column row clean_labels kg_id     kg_labels                             clean_labels_kg_labels_lev normalized_score
1      0   Budapest     Q1781     Budapest|Buda Pest|Buda-Pest|Buda     1                          0.316155989
1      0   Budapest     Q16467392 Budapest (chanson)                    0.44                       0.190807799
1      0   Budapest     Q55420238 Budapest|Budapest, a song             1                          0.281646549
1      1   Prague       Q1085     Prague|Praha|Praha|Hlavní město Praha 1                          0.23816156
1      1   Prague       Q1953283  Prague, Oklahoma                      0.375                      0.223460229
1      1   Prague       Q2084234  Prague, Nebraska                      0.375                      0.216496441
1      1   Prague       Q5969542  Prague                                1                          0.151655834
1      2   London       Q84       London|London, UK|London, England     1                          0.5
1      2   London       Q92561    London ON                             0.66                       0.396549056

# linearly combine the columns 'normalized_score' and 'clean_labels_kg_labels_lev' with respective weights as '1.5' and '2.0'
$ tl combine_linearly -w normalized_score:1.5,clean_labels_kg_labels_lev:2.0 -o ranking_score < countries_features.csv > countries_features_ranked.csv
$ cat countries_features_ranked.csv

column row clean_labels kg_id     kg_labels                             clean_labels_kg_labels_lev normalized_score ranking_score
1      0   Budapest     Q1781     Budapest|Buda Pest|Buda-Pest|Buda     1                          0.316155989      2.474233984
1      0   Budapest     Q16467392 Budapest (chanson)                    0.44                       0.190807799      1.166211699
1      0   Budapest     Q55420238 Budapest|Budapest, a song             1                          0.281646549      2.422469824
1      1   Prague       Q1085     Prague|Praha|Praha|Hlavní město Praha 1                          0.23816156       2.35724234
1      1   Prague       Q1953283  Prague, Oklahoma                      0.375                      0.223460229      1.085190344
1      1   Prague       Q2084234  Prague, Nebraska                      0.375                      0.216496441      1.074744662
1      1   Prague       Q5969542  Prague                                1                          0.151655834      2.227483751
1      2   London       Q84       London|London, UK|London, England     1                          0.5              2.75
1      2   London       Q92561    London ON                             0.66                       0.396549056      1.914823584

Implementation

Multiply the values in the input score-columns with their corresponding weights and add them up to get a ranking score for each candidate.

For each candidate c and the set of score-columns S,

rankingScore(c) := ∑(value(s) * weight(s)) ∀ s ∈ S

predict-using-model[OPTIONS]

Uses a trained contrastive loss neural network to give a score to each of the candidates in the input file. The scores given by the model is used for final ranking.

Options

Example

$ tl predict-using-model -o siamese_prediction \
--ranking_model epoch_3_loss_0.09958004206418991_top1_0.8912429378531074.pth \
--normalization_factor normalization_factor.pkl \
> model_prediction.csv

$ cat model_prediction.csv | head -n 10
|column|row|label      |kg_id    |kg_labels       |kg_descriptions                    |pagerank              |retrieval_score   |monge_elkan       |jaro_winkler      |levenshtein       |des_cont_jaccard|num_char          |num_tokens        |singleton|is_lof|lof-graph-embedding-score|lof-reciprocal-rank|lof_class_count_tf_idf_score|lof_property_count_tf_idf_score|siamese_prediction    |
|------|---|-----------|---------|----------------|-----------------------------------|----------------------|------------------|------------------|------------------|------------------|----------------|------------------|------------------|---------|------|-------------------------|-------------------|----------------------------|-------------------------------|----------------------|
|0     |0  |Virat Kohli|Q213854  |Virat Kohli     |Indian cricket player              |3.866466298654773e-07 |0.220935099736807 |1.0               |1.0               |1.0               |0.0             |0.0445344129554655|0.0526315789473684|1.0      |0.0   |0.8509996990991101       |0.3306451612903225 |0.9999999999999998          |0.799244652580565              |1.0                   |
|0     |0  |Virat Kohli|Q4792485 |Armaan Kohli    |Indian actor                       |8.030872767011763e-07 |0.175499656429328 |0.788888888888889 |0.7563131313131314|0.5833333333333333|0.0             |0.048582995951417 |0.0526315789473684|0.0      |0.0   |0.7711897128982282       |0.0872434017595308 |0.5442234316047087          |0.3725338919656602             |0.0007842204649932    |
|0     |0  |Virat Kohli|Q19843060|Rahul Kohli     |British actor                      |3.9787774655528806e-07|0.175499656429328 |0.8               |0.7348484848484849|0.5454545454545454|0.0             |0.0445344129554655|0.0526315789473684|0.0      |0.0   |0.639493810277712        |0.0173301304049416 |0.5442234316047087          |0.2965526540013362             |3.3086947951233014e-05|
|0     |0  |Virat Kohli|Q7686953 |Taruwar Kohli   |Indian cricketer                   |3.436024992699295e-07 |0.1771932406714975|0.6976190476190476|0.7997927997927997|0.6153846153846154|0.0             |0.0526315789473684|0.0526315789473684|0.0      |0.0   |0.8969803538708947       |1.0                |0.9999999999999998          |0.326228198225337              |1.56978567247279e-05  |
|0     |0  |Virat Kohli|Q19899153|Virat Singh     |Indian cricketer                   |3.436024992699295e-07 |0.1936199254596482|0.7333333333333333|0.865909090909091 |0.5454545454545454|0.0             |0.0445344129554655|0.0526315789473684|0.0      |0.0   |0.8317018964479719       |0.1967741935483871 |0.9999999999999998          |0.3708899236436936             |7.526116405642824e-06 |
|0     |1  |Tendulkar  |Q9488    |Sachin Tendulkar|Indian former cricketer            |1.1610014233298505e-06|0.2886305013881558|0.8564814814814814|0.3912037037037037|0.5625            |0.0             |0.0647773279352226|0.0526315789473684|0.0      |0.0   |0.8200359660591946       |0.4979838709677419 |0.9999999999999998          |0.8192843820684729             |0.9999990463256836    |
|0     |1  |Tendulkar  |Q22327439|Arjun Tendulkar |cricketer                          |4.474188254297618e-07 |0.2090650830325064|0.8435185185185186|0.2851851851851851|0.6               |0.0             |0.0607287449392712|0.0526315789473684|0.0      |0.0   |0.9052622170519932       |1.0                |0.9999999999999998          |0.2971309514678164             |7.058250776026398e-05 |
|0     |1  |Tendulkar  |Q7645792 |Suresh Tendulkar|Indian economist                   |4.935716273049558e-07 |0.2172269801922356|0.837962962962963 |0.2824074074074074|0.5625            |0.0             |0.0647773279352226|0.0526315789473684|0.0      |0.0   |0.7507507619140121       |0.2469758064516128 |0.5442234316047087          |0.1367737750251897             |7.929103048809338e-06 |
|0     |1  |Tendulkar  |Q3630378 |Priya Tendulkar |Marathi actress and social activist|5.024522125756764e-07 |0.2090650830325064|0.75              |0.3925925925925926|0.6               |0.0             |0.0607287449392712|0.0526315789473684|0.0      |0.0   |0.61134836412485         |0.0550284629981024 |0.5442234316047087          |0.1821786700772206             |5.9532030718401074e-06|
|0     |1  |Tendulkar  |Q55744   |Vijay Tendulkar |Indian writer                      |1.1221833978198972e-06|0.2110825609717666|0.75              |0.3925925925925926|0.6               |0.0             |0.0607287449392712|0.0526315789473684|0.0      |0.0   |0.6894453526957718       |0.0732009925558312 |0.5442234316047087          |0.1846962963661935             |5.565782885241788e-06 |

The model is trained on 14 features. So while predicting, the model expects the following 14 features:

Commands on Ranking Score File

Ranking Score files have a column which ranks the candidates for each input cell.

This commands in this module takes as input a Ranking Score file and outputs a file in KG Links format.

drop-by-score[OPTIONS]

The drop-by-score command outputs the top k score candidates for each column, row pair of the input file. The other candidates whose score is out of those will be removed.

Options:

Examples:

# read the ranking score file test_file.csv and keep only the highest score on embed-score column
$ tl drop-by-score test_file.csv -c embed-score -k 1 > output_file.csv

# same example but with default options
$ tl drop-by-score test_file.csv -c embed-score > output_file.csv

File Example:

# read the ranking score file countries_features_ranked.csv and ouput top 2 candidates, column 'clean_labels' have the cleaned input labels
# original file
      column  row                          label      kg_id  retrieval_score_normalized
        2    0   Vladimir Vladimirovich PUTIN      Q7747                    0.999676
        2    0   Vladimir Vladimirovich PUTIN  Q12554172                    0.405809
        2    0   Vladimir Vladimirovich PUTIN   Q1498647                    0.466929
        2    0   Vladimir Vladimirovich PUTIN  Q17052997                    0.404006
        2    0   Vladimir Vladimirovich PUTIN  Q17195494                    0.500758
      ...  ...                            ...        ...                         ...
        2   40  Vasiliy Alekseyevich NEBENZYA  Q64456113                    0.287849
        2   40  Vasiliy Alekseyevich NEBENZYA  Q65043723                    0.319638
        2   40  Vasiliy Alekseyevich NEBENZYA   Q7916774                    0.316741
        2   40  Vasiliy Alekseyevich NEBENZYA   Q7916778                    0.316741
        2   40  Vasiliy Alekseyevich NEBENZYA   Q7972769                    0.262559

$ tl drop-by-score test_file.csv -c embed-score -k 1 > output_file.csv
# output result, note than only 1 candidates remained for each (column, row) pair
$ cat output_file.csv
    column  row                                label      kg_id  retrieval_score_normalized
        2    0         Vladimir Vladimirovich PUTIN      Q7747                    0.999676
        2    1        Dmitriy Anatolyevich MEDVEDEV     Q23530                    0.999676
        2    2           Anton Germanovich SILUANOV    Q589645                    0.999740
        2    3           Maksim Alekseyevich AKIMOV   Q2587075                    0.619504
        2    4              Yuriy Ivanovich BORISOV   Q4093892                    0.688664
        2    5    Konstatin Anatolyevich CHUYCHENKO   Q4517811                    0.455497
        2    6         Tatyana Alekseyevna GOLIKOVA    Q260432                    0.999676
        2    7               Olga Yuryevna GOLODETS   Q3350421                    0.999676
        2    8         Aleksey Vasilyevich GORDEYEV    Q478290                    1.000000
        2    9           Dmitriy Nikolayevich KOZAK    Q714330                    0.601561
        2   10           Vitaliy Leyontyevich MUTKO   Q1320362                    0.666055

Implementation

Group by column and row indices and pick the top k candidates for each input cell. Then drop the remained part of the files and output the result.

drop-duplicate[OPTIONS]

The drop-duplicate command outputs the duplicate rows based on the given column information and keep the one with a higher score on the specified score column. This comamnd usually will be used when multiple candidates generating methods was called, those different method may generate same candidates multiple times which may influence future processes.

Options:

Examples:

# read the ranking score file test_file.csv and keep the higher score on `retrieval_score_normalized` column if duplicate found on (column, row, kg_id) pairs.
$ tl drop-duplicate test_file.csv -c kg_id --score-column retrieval_score_normalized

File Example:

# read the ranking score file countries_features_ranked.csv and ouput top 2 candidates, column 'clean_labels' have the cleaned input labels
# original file
    column  row                                label      kg_id        method  retrieval_score_normalized
      2    0         Vladimir Vladimirovich PUTIN      Q7747   exact-match                    0.999676
      2    0         Vladimir Vladimirovich PUTIN      Q7747  phrase-match                    0.456942
      2    1        Dmitriy Anatolyevich MEDVEDEV     Q23530   exact-match                    0.999676
      2    2           Anton Germanovich SILUANOV    Q589645   exact-match                    0.999740
      2    3           Maksim Alekseyevich AKIMOV   Q2587075  phrase-match                    0.619504
      2    4              Yuriy Ivanovich BORISOV   Q4093892  phrase-match                    0.688664

$ tl drop-duplicate test_file.csv -c kg_id --score-column retrieval_score_normalized > output_file.csv
# output result, note than the duplicate row for column, row pair (2,0) was removed and the one with higher retrieval_score_normalized was kept.
$ cat output_file.csv
    column  row                                label      kg_id        method  retrieval_score_normalized
      2    0         Vladimir Vladimirovich PUTIN      Q7747   exact-match                    0.999676
      2    1        Dmitriy Anatolyevich MEDVEDEV     Q23530   exact-match                    0.999676
      2    2           Anton Germanovich SILUANOV    Q589645   exact-match                    0.999740
      2    3           Maksim Alekseyevich AKIMOV   Q2587075  phrase-match                    0.619504
      2    4              Yuriy Ivanovich BORISOV   Q4093892  phrase-match                    0.688664

$ tl drop-duplicate test_file.csv -c kg_id --score-column retrieval_score_normalized --keep-method phrase-match > output_file.csv
# output result, note than the duplicate row for column, row pair (2,0) was removed, but here we specifiy to keep phrase match, so exact-match's candidate was removed.
$ cat output_file.csv
    column  row                                label      kg_id        method  retrieval_score_normalized
      2    0         Vladimir Vladimirovich PUTIN      Q7747  phrase-match                    0.456942
      2    1        Dmitriy Anatolyevich MEDVEDEV     Q23530   exact-match                    0.999676
      2    2           Anton Germanovich SILUANOV    Q589645   exact-match                    0.999740
      2    3           Maksim Alekseyevich AKIMOV   Q2587075  phrase-match                    0.619504
      2    4              Yuriy Ivanovich BORISOV   Q4093892  phrase-match                    0.688664

Implementation

Group by column and row and specified column pairs indices and pick the higher score one.

get-kg-links[OPTIONS]

The get-kg-links command outputs the top k candidates from a sorted list, as linked knowledge graph objects for an input cell. The candidate with the highest score is ranked highest, ties are broken alphabetically.

Options:

Examples:

# read the ranking score file countries_features_ranked.csv and output top 2 candidates, use the column clean_labels for cleaned input cell labels
$ tl get-kg-links -c ranking_score -l clean_labels -k 2 countries_features_ranked.csv > countries_kg_links.csv

# same example but with default options
$ tl get-kg-links -c ranking_score < countries_features_ranked.csv > countries_output.csv

File Example:

# read the ranking score file countries_features_ranked.csv and ouput top 2 candidates, column 'clean_labels' have the cleaned input labels
$ tl get-kg-links -c ranking_score -l clean_labels -k 2 countries_features_ranked.csv > countries_kg_links.csv
$ cat countries_links.csv

column row label        kg_id           kg_labels         ranking_score
1      0   Budapest     Q1781|Q55420238 Budapest|Budapest 2.474233984|2.422469824
1      1   Prague       Q1085|Q5969542  Prague|Prague     2.35724234|2.227483751
1      2   London       Q84|Q92561      London|London ON  2.75|1.914823584

The following example shows the use of --k-rows parameter

$ tl get-kg-links -c ranking_score -l clean_labels -k 2 --k-rows comapnies_features_ranked.csv > companies_kg_links.csv
$ cat companies_kg_links.csv

|column|row|label          |kg_id    |kg_labels            |ranking_score      |
|------|---|---------------|---------|---------------------|-------------------|
|1     |0  |Citigroup      |Q219508  |Citigroup            |0.9823348588687812 |
|1     |0  |Citigroup      |Q1023765 |CIT Group            |-0.3970555555555555|
|1     |1  |Bank of America|Q487907  |Bank of America      |0.9054747474747477 |
|1     |1  |Bank of America|Q50316068|Bank of America      |0.227679847085417  |

Implementation

Group by column and row indices and pick the top k candidates for each input cell to produce an output file in KG Links format.

Pick the preferred labels for candidate KG objects from the column kg_labels, which is added by the get-exact-matches command. In case of more than one preferred label for a candidate, the first label is picked.

join[OPTIONS]

The join command outputs the linked knowledge graph objects for an input cell. This command takes as input a Input file and a file in Ranking Score format and outputs a file in Output format

The candidate with the highest score is ranked highest, ties are broken alphabetically.

Options:

Examples:

# read the input file countries.csv and the ranking score file countries_features_ranked.csv and output top 2 candidates
$ tl join -f countries.csv --csv -c ranking_score countries_features_ranked.csv > countries_output.csv

# same example but with default options
$ tl join -f countries.csv --csv -c ranking_score < countries_features_ranked.csv > countries_output.csv

File Example:

# read the input file countries.csv and the ranking score file countries_features_ranked.csv and ouput top 2 candidates
$ tl join -f countries.csv --csv -c ranking_score countries_features_ranked.csv > countries_output.csv
$ cat countries_output.csv

country        capital_city phone_code capital_city_kg_id capital_city_kg_label capital_city_score
Hungary        Buda’pest    +49        Q1781|Q55420238    Budapest|Budapest     2.474233984|2.422469824
Czech Republic Prague       +420       Q1085|Q5969542     Prague|Prague         2.35724234|2.227483751
United Kingdom London!      +44        Q84|Q92561         London|London ON      2.75|1.914823584

Implementation

Join the input file and the ranking score file based on column and row indices to produce an output file. In case of more than one preferred label for a candidate, the first label is picked. The corresponding values in each output column have the same index, in case of k > 1

This command will add the following three columns to the input file to produce the output file.

Evaluation Commands

Evaluation commands take as input a Ranking Score file and a Ground Truth file and output a file in the Evaluation File format. These commands help in calculating precision and recall of the table linker (tl) pipeline.

ground-truth-labeler[OPTIONS]

The ground-truth-labeler command compares each candidate for the input cells with the ground truth value for that cell and adds an evaluation label.

Options:

File Examples:

# the ground truth file, countries_gt.csv
$ cat countries_gt.csv

column row kg_id
1      0   Q1781
1      2   Q84

# add evaluation label to the ranking score file countries_features_ranked.csv, having the column 'ranking_score', using the ground truth file countries_gt.csv
$ tl ground-truth-labeler -f countries_gt.csv < countries_features_ranked.csv > countries_evaluation.csv
$ cat countries_evaluation.csv

column row clean_labels kg_id     ranking_score evaluation_label GT_kg_id GT_kg_label
1      0   Budapest     Q1781     8.01848598     1               Q1781    Budapest
1      0   Budapest     Q16467392 4.152548805   -1               Q1781    Budapest
1      0   Budapest     Q55420238 7.65849315    -1               Q1781    Budapest
1      1   Prague       Q1085     7.00211745     0
1      1   Prague       Q1953283  4.19621823     0
1      1   Prague       Q2084234  4.029960225    0
1      1   Prague       Q5969542  5.81884368     0
1      2   London       Q84       9.02968554     1               Q84      London
1      2   London       Q92561    5.757565725   -1               Q84      London

Implementation

Join the ranking score file and the ground truth file based on column and row indices and add the following columns,

The value 0 means the cell is not present in the Ground Truth File. The value -1 means the cell is present in the Ground Truth File and the candidate is different from the corresponding knowledge graph object in the Ground Truth File.

Utility Commands

add-color[OPTIONS]

The add-color command is a special command that can only run as the last step of the pipeline / run separately because the generated file is a xlsx file but not a csv file. This command can be used to marked the top-k score of specified score columns for better visualization. It also support with a ground-truth format which can only run on a file after running with add-text-embedding-feature function and ground-truth column-vector-strategy. The rows for each candidate will then be ordered descending by gt score, except that the first row is the ground truth candidate regardless of whether it didn't get the highest gt cosine score

Options:

run color only

$ tl add-color ~/Desktop/test.csv -k 5 \ -c retrieval_score_normalized evaluation_label \ --output ~/Desktop/test_colored.xlsx


**File Example:**
```bash
# the output is same as the input file if not sort
   column  row  retrieval_score_normalized  evaluation_label  gt_embed_score_normalized
0       2    0                    0.999676                 1                   0.398855
1       2    0                    0.405809                -1                   0.403718
2       2    0                    0.466929                -1                   0.203675
3       2    0                    0.404006                -1                   0.255186
4       2    0                    0.500758                -1                   0.243571
5       2    0                    0.541115                -1                   0.675757
6       2    0                    0.417415                -1                   0.231361
7       2    0                    0.752838                -1                   0.540415
8       2    0                    0.413938                -1                   0.220305
9       2    0                    0.413938                -1                   0.228466

Implementation

By using pandas's xls writer function, add some special format to some cells.

plot-score-figure[OPTIONS]

The plot-score-figure command is a special command that can only run as the last step of the pipeline / run separately because the generated file is a png file or a html file but not a csv file. This command can be used to evaulate the predictions results and generated scores of the table linker. It only support the plot on the results after running with ground-truth-labler as ground truth information is needed for evaluation. The first plot will be a png image, which includes the top k scores of the specified score columns in accuracy and corresponding normalized score. The second plot will be a html page, which includes the scores of specified columns on correct candidates and some high score wrong candidates if required. This page allow users to do interaction operations like remove the view of the score on specific columns, enlarge and many other choices...

Options:

Examples:

# plot the figures for `test.csv` file on column `evaluation_label` and `retrieval_score_normalized`
# then save to desktop, also add the evaluation wrong candidates on second graph
$ tl plot-score-figure ~/Desktop/test.csv -k 1 2 5 \
-c retrieval_score_normalized evaluation_label \
--add-wrong-candidates retrieval_score_normalized \
--output ~/Desktop/output_figure

# run default plot
$ tl plot-score-figure ~/Desktop/test.csv -k 1 2 5 \
-c retrieval_score_normalized evaluation_label \
--output ~/Desktop/output_figure

File Example: The output is not a table, please refer to here (access needed).

Implementation

By using python package seaborn and pyecharts, we plotted the output figures.

run-pipeline[OPTIONS]

The run-pipeline command is a batch running command that enable users to run same pipelines on a batch of files automatically and then evaluate the pipeline running results if possible.

Options:

Examples:

# run a pipeline on all files starting with `v15_68` and ends with `.csv` on folder `iswc_challenge_data/round4/canonical/`
# clean -> get exact-matches candidates -> normalize score -> get phase-matches -> normalize score -> add ground truths -> get embedding scores
# Set to output with a tag gt-embed and score the output base on column `embed-score`, and run 4 processes parallelly. Also, turn on the debug mode.
$ tl run-pipeline \
  --tag gt-embed \
  --gpu-resources 1 \
  --parallel-count 4 \
  --score-column embed-score \
  --debug \
  --ground-truth-directory iswc_challenge_data/round4/gt \
  --ground-truth-file-pattern {}.csv \
  --pipeline 'clean -c label / get-exact-matches -c label_clean / normalize-scores -c retrieval_score \
    / get-phrase-matches -c label_clean -n 5 --filter "retrieval_score_normalized > 0.9" / normalize-scores -c retrieval_score \
    / ground-truth-labeler -f iswc_challenge_data/round4/gt/{}.csv \
    / add-text-embedding-feature --column-vector-strategy ground-truth -n 0 --run-TSNE True \
    --distance-function cosine -o embed-score \
    iswc_challenge_data/round4/canonical/v15_68*.csv
File Example: The output will be a csv looks like: tag file precision recall f1
gt-embed v15_685.csv 0.473684211 0.473684211 0.473684211
gt-embed v15_686.csv 0.115384615 0.115384615 0.115384615

Implementation

This command used python's subprocess to call shell functions then execute the corresponding shell codes.

tee[OPTIONS]

The tee command saves the input to disk and echoes the input to the standard output without modification. The command can be put anywhere in a pipeline to save the input to a file. This command is a wrap of the linux command tee

Options:

Examples:

# After performing the expensive operations to get candidates and compute embeddings, save the file to disk and continue the pipeline.
$ tl clean /
    / get-exact-matches -c label \
    / ground-truth-labeler -f “./xxx_gt.csv” \
    / add-text-embedding-feature --column-vector-strategy ground-truth -n 3 \
      --generate-projector-file xxx-google-projector -o embed
    / tee --output xxx-features.csv \
    / normalize-scores \
    / metrics

vote-by-classifier[OPTIONS]

The vote-by-classifier command computes the prediction result of specified voting classifier on input tables with the following features:

Options:

Examples:

# After performing the expensive operations to get candidates and compute embeddings, save the file to disk and continue the pipeline.
$ tl vote-by-classifier candidates.csv \
--prob-threshold 0.995 \
--model weighted_lr.pkl \
> voted_candidates.csv

File Examples:

|column|row|label          |kg_id     |...|method            |aligned_pagerank|vote_by_classifier|
|------|---|---------------|----------|---|------------------|----------------|------------------|
|1     |0  |Citigroup      |Q219508   |...|exact-match       |3.988134e-09    |0                 |
|1     |1  |Bank of America|Q487907   |...|exact-match       |5.115590e-09    |1                 |
|1     |1  |Bank of America|Q50316068 |...|exact-match       |5.235995e-09    |1                 |
|1     |10 |BP             |Q100151423|...|fuzzy-augmented   |0.000000e+00    |0                 |
|1     |10 |BP             |Q131755   |...|fuzzy-augmented   |0.000000e+00    |1                 |

pgt-semantic-tf-idf[OPTIONS]

The pgt-semantic-tf-idf command adds two feature columns by computing the tf-idf like score based on high confidence candidates for an input column. This command is described in the following image:

This commands follows the following procedure:

Step 1: Get the set of high confidence candidates. High confidence candidates, for each cell, are defined as candidates which has the method exact-match or have the highest pagerank * retrieval_score with method fuzzy-augmented.

Step 2: For each of the high confidence candidates get the class-count data. This data is stored in Elasticseach index and is gathered during the candidate generation step.

The data consists of q-node:count pairs where the q-node represents a class and the count is the number of instances below the class. These counts use a generalized version of is-a where occupations and position held are considered is-a, eg, Scwarzenegger is an actor.

Similarly, another dataset consists of p-node:count pairs where p-node represents a property the candidate qnode has and count is the total number of qnodes in the corpus which has this property.

Step 3: Make a set of all the classes that appear in the high confidence classes, and count the number of times each class occurs in each candidate. For example, if two high precision candidates are human, then Q5 will have num-occurrences = 2.

Step 4: Convert the instance counts for the set constructed in step 3 to IDF (see https://en.wikipedia.org/wiki/Tf–idf), and then multiply the IDF score of each class by the num-occurrences number from step 3. Then, normalize them so that all the IDF scores for the high confidence candidates sum to 1.0.

Step 5: For each candidate in a cell, compute xi : 1 / # of times a class appears for all candidates in a cell.

Step 6: For each column, compute alphaj : Σxi / # of rows in the input column

Step 7: For each candidate, including high confidence candidates, compute the tf-idf score by adding up the product of (IDF scores (computed in Step 4), alphaj and xi, for all the classes for that candidate. If the class appears in the high confidence classes, then multiple the class IDF by 1 otherwise by 0.

Options:

Examples:

$ tl pgt-semantic-tf-idf --feature-file class_count.tsv \
     --feature-name class_count \
     --pagerank-column pagerank \
     --retrieval-score-column retrieval_score \
     -o class_count_tf_idf_score \
     candidates.csv

File Example:

$ tl compute-tf-idf --feature-file class_count.tsv \
     --feature-name class_count \
     --pagerank-column pagerank \
     --retrieval-score-column retrieval_score \
     -o class_count_tf_idf_score \
     companies_candidates.csv > candidates_pgt.csv

$ head companies_candidates.csv
column row label label_clean kg_id kg_labels method pagerank retrieval_score
1 0 Citigroup Citigroup Q219508 Citigroup exact-match 6.80E-08 16.441374
1 0 Citigroup Citigroup Q219508 Citigroup fuzzy-augmented 6.80E-08 19.778538
1 0 Citigroup Citigroup Q391243 Citigroup Center fuzzy-augmented 7.26E-09 17.1681
1 0 Citigroup Citigroup Q781961 One Court Square fuzzy-augmented 4.77E-09 17.106327
1 0 Citigroup Citigroup Q2425550 Citigroup Center fuzzy-augmented 2.84E-09 16.928787
1 0 Citigroup Citigroup Q54491 Citigroup Centre (Sydney)|Citigroup Centre fuzzy-augmented 2.84E-09 16.907793
1 0 Citigroup Citigroup Q5122510 Citigroup Global Markets Japan fuzzy-augmented 5.68E-09 16.720482
1 0 Citigroup Citigroup Q867663 Citigroup Centre fuzzy-augmented 4.19E-09 16.564976
1 0 Citigroup Citigroup Q5122507 Citigroup Tower fuzzy-augmented 2.84E-09 16.550804
$ head candidates_pgt.csv
column row label label_clean kg_id pagerank retrieval_score pgr_rts smc_score kg_labels method hc_candidate top5_smc_score
1 0 Citigroup Citigroup Q219508 6.80E-08 16.441374 1.12E-06 0.261552749 Citigroup exact-match 1 Q6881511:0.009|Q12047392:0.009|Q362482:0.009|Q679206:0.009|Q155076:0.009
1 0 Citigroup Citigroup Q219508 6.80E-08 19.778538 1.34E-06 0 Citigroup fuzzy-augmented 0
1 0 Citigroup Citigroup Q391243 7.26E-09 17.1681 1.25E-07 0 Citigroup Center fuzzy-augmented 0
1 0 Citigroup Citigroup Q781961 4.77E-09 17.106327 8.16E-08 0 One Court Square fuzzy-augmented 0
1 0 Citigroup Citigroup Q2425550 2.84E-09 16.928787 4.81E-08 0 Citigroup Center fuzzy-augmented 0
1 0 Citigroup Citigroup Q54491 2.84E-09 16.907793 4.80E-08 0 Citigroup Centre (Sydney)|Citigroup Centre fuzzy-augmented 0
1 0 Citigroup Citigroup Q5122510 5.68E-09 16.720482 9.50E-08 0 Citigroup Global Markets Japan fuzzy-augmented 0
1 0 Citigroup Citigroup Q867663 4.19E-09 16.564976 6.94E-08 0 Citigroup Centre|25 Canada Square|Citigroup Centre (Londra)|Citigroup Centre (Londres) fuzzy-augmented 0
1 0 Citigroup Citigroup Q5122507 2.84E-09 16.550804 4.70E-08 0 Citigroup Tower fuzzy-augmented 0

pick-hc-candidates[OPTIONS]

The pick-hc-candidates command picks high confidence candidates based on the algorithm as described below.

SMC number of cells

The desired number of cells that the SMC algorithm will consider,

Select potential candidates

Defintion: best string similarity (best_sim):= maximum of string similarities for each cell.

Definition: equal_sim(candidate): = # of other candidates for a cell such that best_sim(candidate) = best_sim(other candidates).

Example: if a cell has candidates c1,c2,c3 and all have same best_sim(ci) = 1.0, then equal_sim(ci) = 3, {i=1,2,3} because there are 3 other candidates with best_sim = 1.0

Options:

Examples:

$ tl pick-hc-candidates -s monge_elkan,monge_elkan_aliases,jaro_winkler,levenshtein \
    -o ignore_candidate \
    --maximum-cells 50 \
    --minimum-cells 8 \
    --desired-cell-factor 0.3 \
    --string-similarity-threshold 0.95 \
    --string-similarity-threshold-2 0.87 \
    --filter-above mean \
    music_singles_candidates.csv > music_singles_hc_candidates.csv

File Example:

$ head music_singles_hc_candidates.csv
column row label label_clean kg_id kg_labels pagerank retrieval_score monge_elkan monge_elkan_aliases jaro_winkler levenshtein best_str_similarity equal_sim ignore_candidate
0 11 The King of Pop The King of Pop Q2831 Michael Jackson 5.65E-07 16.442932 0.174107143 1 0.488888889 0.066666667 1 2 0
0 11 The King of Pop The King of Pop Q2831 Michael Jackson 5.65E-07 25.133442 0.174107143 1 0.488888889 0.066666667 1 2 0
0 5 The Fab Four The Fab Four Q7732974 The Fab Four 2.84E-09 16.438158 1 0 1 1 1 2 0
0 5 The Fab Four The Fab Four Q7732974 The Fab Four 2.84E-09 30.132828 1 0 1 1 1 2 0
0 3 The King of the Blues The King of the Blues Q60786631 The King of the Blues 2.84E-09 16.437386 1 0 1 1 1 3 0
0 3 The King of the Blues The King of the Blues Q60786631 The King of the Blues 2.84E-09 27.152079 1 0 1 1 1 3 0
0 3 The King of the Blues The King of the Blues Q3197058 King of the Blues 2.84E-09 26.890848 1 0 0.799253035 0.80952381 1 3 1
0 12 Slowhand Slowhand Q48187 Eric Clapton 2.23E-07 16.442932 0.455357143 1 0.402777778 0.083333333 1 4 0
0 12 Slowhand Slowhand Q549602 Slowhand 3.82E-09 15.8904705 1 0 1 1 1 4 1

kth-percentile[OPTIONS]

The kth-percentile computes the kth percentile for a given column and marks those rows as 1 which are above the kth percentile as 1. In addition, if the option --ignore-column is specified, kth percentile is computed using rows which are marked as ignore = 0.

Options:

Examples:

$ tl kth-percentile -c retrieval_score \
    -o kth_percenter \
    --k-percentile 0.05 \
    music_singles_candidates.csv > music_singles_hc_candidates.csv
$ tl kth-percentile -c retrieval_score \
    -o kth_percenter \
    --k-percentile 0.05 \
    --ignore-column ignore_candidate \
    music_singles_candidates.csv > music_singles_hc_candidates.csv

File Example:

$ tl kth-percentile -c retrieval_score \
    -o kth_percenter \
    --k-percentile 0.05 \
    music_singles_candidates.csv > music_singles_hc_candidates.csv

$ head music_singles_hc_candidates.csv
column row label label_clean kg_id kg_labels retrieval_score kth_percenter
0 0 The King of Rock 'n' Roll The King of Rock 'n' Roll 0 0
0 0 The King of Rock 'n' Roll The King of Rock 'n' Roll Q303 Elvis Presley 45.493786 1
0 0 The King of Rock 'n' Roll The King of Rock 'n' Roll Q7744584 The King of Rock \n\ Roll 35.69553 1
0 0 The King of Rock 'n' Roll The King of Rock 'n' Roll Q4319576 Nick Rock\n\Roll 35.426224 1
0 0 The King of Rock 'n' Roll The King of Rock 'n' Roll Q3437721 Rock \n\ Roll Is King 34.43733 1
0 0 The King of Rock 'n' Roll The King of Rock 'n' Roll Q4051146 The King of Rock and Roll 32.125546 1
0 0 The King of Rock 'n' Roll The King of Rock 'n' Roll Q7728602 The Daddy of Rock \n\ Roll 31.09376 1
0 0 The King of Rock 'n' Roll The King of Rock 'n' Roll Q7761239 The Rock \n\ Roll Express 30.621666 1
0 0 The King of Rock 'n' Roll The King of Rock 'n' Roll Q941904 Rock ’n’ Roll|Rock \n\ Roll 30.403358 1