Open aphedges opened 3 years ago
This is coming from our labels file
Q10988133-label-en-gb Q10988133 label 'cow\'s milk'@en-gb en-gb
I'll check with Craig on how to handle this in a generic way
@CraigMiloRogers how am I supposed to use the unstringify
or destringify
functions?
Here is what I tried ,
from kgtk.kgtkformat import KgtkFormat
kf = KgtkFormat()
x = "Q10988133-label-en\tQ10988133\tlabel\t'cow\'s milk'@en".split('\t')[3]
print(x)
print(kf.unstringify(x))
Error:
Traceback (most recent call last):
File "/Users/amandeep/anaconda3/envs/kgtk-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-2197eaee7454>", line 1, in <module>
runfile('/Users/amandeep/Github/kgtk/kgtk/utils/elasticsearch_manager.py', wdir='/Users/amandeep/Github/kgtk/kgtk/utils')
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/amandeep/Github/kgtk/kgtk/utils/elasticsearch_manager.py", line 906, in <module>
print(kf.unstringify(x))
File "/Users/amandeep/Github/kgtk/kgtk/kgtkformat.py", line 117, in unstringify
return ast.literal_eval(s)
File "/Users/amandeep/anaconda3/envs/kgtk-env/lib/python3.7/ast.py", line 46, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/Users/amandeep/anaconda3/envs/kgtk-env/lib/python3.7/ast.py", line 35, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
'cow's milk'
^
SyntaxError: invalid syntax
print(kf.destringify(x))
Error:
Traceback (most recent call last):
File "/Users/amandeep/anaconda3/envs/kgtk-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-2197eaee7454>", line 1, in <module>
runfile('/Users/amandeep/Github/kgtk/kgtk/utils/elasticsearch_manager.py', wdir='/Users/amandeep/Github/kgtk/kgtk/utils')
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/amandeep/Github/kgtk/kgtk/utils/elasticsearch_manager.py", line 907, in <module>
print(kf.destringify(x))
File "/Users/amandeep/Github/kgtk/kgtk/kgtkformat.py", line 136, in destringify
return (ast.literal_eval(s), language, language_suffix)
File "/Users/amandeep/anaconda3/envs/kgtk-env/lib/python3.7/ast.py", line 46, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/Users/amandeep/anaconda3/envs/kgtk-env/lib/python3.7/ast.py", line 35, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
'cow's milk'
^
SyntaxError: invalid syntax
I get Error 404 when I try to visit https://kgtk.isi.edu/api/Q10988133?extra_info=true&language=en.
The labels file line passes validation:
kgtk cat -i issue-378.tsv
id node1 label node2 lang
Q10988133-label-en-gb Q10988133 label 'cow\'s milk'@en-gb en-gb
117% kgtk validate -i issue-378.tsv -v --allow-language-suffixes
====================================================
Validating 'issue-378.tsv'
KgtkReader: File_path.suffix: .tsv
KgtkReader: reading file issue-378.tsv
header: id node1 label node2 lang
input format: kgtk
node1 column found, this is a KGTK edge file
KgtkReader: Special columns: node1=1 label=2 node2=3 id=0
KgtkReader: Reading an edge file.
Validated 1 data lines
====================================================
Data lines read: 1
Data lines passed: 1
117%
In the code example, the backslash before the embedded quote escapes just the embedded quote, even though the outer quote for the string is "
and the inner quote is '
. You need an escaped backslash in the string constant to for the KGTK escape backslash. Thus, cow\\\'s milk
.
from kgtk.kgtkformat import KgtkFormat
kf = KgtkFormat()
x = "Q10988133-label-en\tQ10988133\tlabel\t'cow\\\'s milk'@en".split('\t')[3]
print(x)
print(kf.unstringify(x))
122% python3 issue-378.py
'cow\'s milk'@en
cow's milk
123%
Two questions:
'
) being replaced by two backslashes (\\
)? If so, we can at least use it before the fix is made in KGTK.@aphedges this got put on back burner. No update yet. Also not sure if this only happens for apostrophe, have you seen this anywhere else?
@saggu, thanks for letting me know. In the ~20 instances I found it in our data, it was always apostrophes, but I didn't check all of DWD to verify it is the only case. Here are some more examples I encountered:
description
field of P3828
: clothing or accessory worn on subject\s body
label
field of Q45382
: coup d\état
description
field of Q707125
: one person\s statement that they intend to harm another, or another\s property
Also, I just realized that the \\
is from JSON escaping the \
. It's actually just a single \
in the text in place of an apostrophe.
I can still reproduce the apostrophe issue, and I discovered that is applies to quotation marks and backslashes as well:
$ curl 'https://dwd.isi.edu/api?q=Q10988133&extra_info=true&language=en&type=exact&is_class=true'
[{"qnode": "Q10988133", "description": ["milk produced by female cattle"], "label": ["cow\\s milk"], "alias": ["cow milk", "whole milk", "milk"], "pagerank": 1.021701595451758e-06, "statements": 223, "score": 0.20568907}]
$ curl 'https://dwd.isi.edu/api?q=Q18336849&extra_info=true&language=en&type=exact&is_class=true'
[{"qnode": "Q18336849", "description": ["(used to define the domain of the \\given name\\ property)"], "label": ["entity whose item has the given name property"], "alias": ["items with given name property"], "pagerank": 2.7525000872635554e-08, "statements": 106, "score": 0.0055411328}]
$ curl 'https://dwd.isi.edu/api?q=Q11185&extra_info=true&language=en&type=exact&is_class=true'
[{"qnode": "Q11185", "description": ["typographical mark (glyph) used mainly in computing"], "label": ["backslash"], "alias": ["reverse slant", "backslant", "bash", "reversed virgule", "reverse slash", "slosh", "\ud83d\ude7d", "escape", "\\\\", "whack", "backwhack", "hack"], "pagerank": 1.7101026535268105e-08, "statements": 155, "score": 0.0034427806}]
Q18336849 ("entity whose item has the given name property") contains quotation marks ("
) in its description, which are also replaced with \
in Wikidata. This means that I cannot accurately recover the original because a \
in the API output might have been either '
or "
in the original.
Q11185 ("backslash") contains a backslash \
in its aliases that is returned as \\
by the API and is represented as \\\\
in the JSON. It's presumably escaped once in KGTK's processing and another time by conversion to JSON.
Describe the bug When searching the KGTK API endpoint for
milk
in WIkidata, the entry Q10988133 ("cow's milk") returns the label "cow\\s milk
" instead.To Reproduce Steps to reproduce the behavior:
Expected behavior The label should be "
cow's milk
", as it appears on the site.Additional context This bug seems to be independent of what is calling the API. I noticed it using Requests and was able to reproduce it with both curl and Firefox.