usc-isi-i2 / kgtk

Knowledge Graph Toolkit
https://kgtk.readthedocs.io/en/latest/
MIT License
354 stars 57 forks source link

Incorrect apostrophe handling in label returned from endpoint #378

Open aphedges opened 3 years ago

aphedges commented 3 years ago

Describe the bug When searching the KGTK API endpoint for milk in WIkidata, the entry Q10988133 ("cow's milk") returns the label "cow\\s milk" instead.

To Reproduce Steps to reproduce the behavior:

  1. Go to https://kgtk.isi.edu/api/Q10988133?extra_info=true&language=en.
  2. See error.

Expected behavior The label should be "cow's milk", as it appears on the site.

Additional context This bug seems to be independent of what is calling the API. I noticed it using Requests and was able to reproduce it with both curl and Firefox.

saggu commented 3 years ago

This is coming from our labels file

Q10988133-label-en-gb   Q10988133       label   'cow\'s milk'@en-gb     en-gb

I'll check with Craig on how to handle this in a generic way

saggu commented 3 years ago

@CraigMiloRogers how am I supposed to use the unstringify or destringify functions?

Here is what I tried ,

from kgtk.kgtkformat import KgtkFormat
kf = KgtkFormat()
x = "Q10988133-label-en\tQ10988133\tlabel\t'cow\'s milk'@en".split('\t')[3]
print(x)
print(kf.unstringify(x))

Error:

Traceback (most recent call last):
  File "/Users/amandeep/anaconda3/envs/kgtk-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-2197eaee7454>", line 1, in <module>
    runfile('/Users/amandeep/Github/kgtk/kgtk/utils/elasticsearch_manager.py', wdir='/Users/amandeep/Github/kgtk/kgtk/utils')
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/amandeep/Github/kgtk/kgtk/utils/elasticsearch_manager.py", line 906, in <module>
    print(kf.unstringify(x))
  File "/Users/amandeep/Github/kgtk/kgtk/kgtkformat.py", line 117, in unstringify
    return ast.literal_eval(s)
  File "/Users/amandeep/anaconda3/envs/kgtk-env/lib/python3.7/ast.py", line 46, in literal_eval
    node_or_string = parse(node_or_string, mode='eval')
  File "/Users/amandeep/anaconda3/envs/kgtk-env/lib/python3.7/ast.py", line 35, in parse
    return compile(source, filename, mode, PyCF_ONLY_AST)
  File "<unknown>", line 1
    'cow's milk'
         ^
SyntaxError: invalid syntax
print(kf.destringify(x))

Error:

Traceback (most recent call last):
  File "/Users/amandeep/anaconda3/envs/kgtk-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-2197eaee7454>", line 1, in <module>
    runfile('/Users/amandeep/Github/kgtk/kgtk/utils/elasticsearch_manager.py', wdir='/Users/amandeep/Github/kgtk/kgtk/utils')
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/amandeep/Github/kgtk/kgtk/utils/elasticsearch_manager.py", line 907, in <module>
    print(kf.destringify(x))
  File "/Users/amandeep/Github/kgtk/kgtk/kgtkformat.py", line 136, in destringify
    return (ast.literal_eval(s), language, language_suffix)
  File "/Users/amandeep/anaconda3/envs/kgtk-env/lib/python3.7/ast.py", line 46, in literal_eval
    node_or_string = parse(node_or_string, mode='eval')
  File "/Users/amandeep/anaconda3/envs/kgtk-env/lib/python3.7/ast.py", line 35, in parse
    return compile(source, filename, mode, PyCF_ONLY_AST)
  File "<unknown>", line 1
    'cow's milk'
         ^
SyntaxError: invalid syntax
CraigMiloRogers commented 3 years ago

I get Error 404 when I try to visit https://kgtk.isi.edu/api/Q10988133?extra_info=true&language=en.

CraigMiloRogers commented 3 years ago

The labels file line passes validation:

kgtk cat -i issue-378.tsv
id      node1   label   node2   lang
Q10988133-label-en-gb   Q10988133       label   'cow\'s milk'@en-gb     en-gb
117% kgtk validate -i issue-378.tsv -v --allow-language-suffixes

====================================================
Validating 'issue-378.tsv'
KgtkReader: File_path.suffix: .tsv
KgtkReader: reading file issue-378.tsv
header: id      node1   label   node2   lang
input format: kgtk
node1 column found, this is a KGTK edge file
KgtkReader: Special columns: node1=1 label=2 node2=3 id=0
KgtkReader: Reading an edge file.
Validated 1 data lines

====================================================
Data lines read: 1
Data lines passed: 1
117% 
CraigMiloRogers commented 3 years ago

In the code example, the backslash before the embedded quote escapes just the embedded quote, even though the outer quote for the string is " and the inner quote is '. You need an escaped backslash in the string constant to for the KGTK escape backslash. Thus, cow\\\'s milk.

from kgtk.kgtkformat import KgtkFormat
kf = KgtkFormat()
x = "Q10988133-label-en\tQ10988133\tlabel\t'cow\\\'s milk'@en".split('\t')[3]
print(x)
print(kf.unstringify(x))
CraigMiloRogers commented 3 years ago
122% python3 issue-378.py
'cow\'s milk'@en
cow's milk
123%
aphedges commented 2 years ago

Two questions:

  1. What is the status of this bug? It's still affecting us.
  2. Is the problem always an apostrophe (') being replaced by two backslashes (\\)? If so, we can at least use it before the fix is made in KGTK.
saggu commented 2 years ago

@aphedges this got put on back burner. No update yet. Also not sure if this only happens for apostrophe, have you seen this anywhere else?

aphedges commented 2 years ago

@saggu, thanks for letting me know. In the ~20 instances I found it in our data, it was always apostrophes, but I didn't check all of DWD to verify it is the only case. Here are some more examples I encountered:

Also, I just realized that the \\ is from JSON escaping the \. It's actually just a single \ in the text in place of an apostrophe.

aphedges commented 2 years ago

I can still reproduce the apostrophe issue, and I discovered that is applies to quotation marks and backslashes as well:

$ curl 'https://dwd.isi.edu/api?q=Q10988133&extra_info=true&language=en&type=exact&is_class=true'
[{"qnode": "Q10988133", "description": ["milk produced by female cattle"], "label": ["cow\\s milk"], "alias": ["cow milk", "whole milk", "milk"], "pagerank": 1.021701595451758e-06, "statements": 223, "score": 0.20568907}]
$ curl 'https://dwd.isi.edu/api?q=Q18336849&extra_info=true&language=en&type=exact&is_class=true'
[{"qnode": "Q18336849", "description": ["(used to define the domain of the \\given name\\ property)"], "label": ["entity whose item has the given name property"], "alias": ["items with given name property"], "pagerank": 2.7525000872635554e-08, "statements": 106, "score": 0.0055411328}]
$ curl 'https://dwd.isi.edu/api?q=Q11185&extra_info=true&language=en&type=exact&is_class=true'
[{"qnode": "Q11185", "description": ["typographical mark (glyph) used mainly in computing"], "label": ["backslash"], "alias": ["reverse slant", "backslant", "bash", "reversed virgule", "reverse slash", "slosh", "\ud83d\ude7d", "escape", "\\\\", "whack", "backwhack", "hack"], "pagerank": 1.7101026535268105e-08, "statements": 155, "score": 0.0034427806}]

Q18336849 ("entity whose item has the given name property") contains quotation marks (") in its description, which are also replaced with \ in Wikidata. This means that I cannot accurately recover the original because a \ in the API output might have been either ' or " in the original.

Q11185 ("backslash") contains a backslash \ in its aliases that is returned as \\ by the API and is represented as \\\\ in the JSON. It's presumably escaped once in KGTK's processing and another time by conversion to JSON.