Closed ozancaglayan closed 5 years ago
We're glad you liked it! This looks like an import problem. This query works for me, and I have the following entry in my db:
D18-2029|cer2018a:universal|Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray|Universal Sentence Encoder for English|Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations|2018|@InProceedings{cer2018a:universal,
but I have FTS support.
How did you import this entry? What happens if you run
bib add https://aclanthology.coli.uni-saarland.de/papers/D18-2029/d18-2029.bib
@davvil, any guesses what might have caused this?
Is there way to quickly see whether sqlite is compiled with FTS support or not? But in any case FTS or no-FTS should involve in later parts of the processing, no?
UPDATE: There was a new release of pybtex
from yesterday thus I downgraded it to 0.21 but still this snippet from bibdb.py
fails for me. I have entry.persons
but not entry.fields['author']
215 if not entry.key:
216 return False
218 if not entry.fields.get("author"):
219 entry.fields["author"] = "UNKNOWN"
EDIT: Tried this and it seems that it has FTS support:
sqlite> WITH opts(n, opt) AS (
...> VALUES(0, NULL)
...> UNION ALL
...> SELECT n + 1,
...> sqlite_compileoption_get(n)
...> FROM opts
...> WHERE sqlite_compileoption_get(n) IS NOT NULL
...> )
...> SELECT opt
...> FROM opts
...> WHERE opt LIKE '%FTS%';
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5
$ rm -rf .bibsearch
$ bibsearch add https://aclanthology.coli.uni-saarland.de/papers/D18-2029/d18-2029.bib
Added 1 entries, skipped 0 duplicates. Skipped 0 files
$ bibsearch find Yinfei
$ bibsearch print
@InProceedings{unknown2018:universal,
author = "Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray",
title = "Universal Sentence Encoder for English",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "169--174",
location = "Brussels, Belgium",
url = "http://aclweb.org/anthology/D18-2029",
author = "UNKNOWN",
original_key = "D18-2029"
}
I don't think this has to do with FTS, as FTS involves only search and the entry is found. What is quite suspicious is that the bibtex entry has two author fields, the second one being "UNKNOWN", which is probably what is taken for generating the key. I also suspect a problem when importing/parsing the entry. Unfortunately I am also not able to reproduce it on my system :-(
What version of pybtex are you using? Did this happen with an empty database or did you include the entry after adding others?
Yes, this is quite strange to see two author fields. I agree it looks like a parsing error. Do you have time and interest to debug this? And can you provide more details about your environment (OS, python version, etc)?
Oh I didn't see the second author field above. Yes let me dig into it a little. Sorry for updating again and again my previous comment for which probably you did not receive separate notifications. I tried with pybtex 0.21 and 0.22 and got the same result.
No need to apologize—we're happy to have someone point out a bug and go through the work of trying to fix it. I think it should be easy to track down: either pybtex parsing is broken (which would be strange, since this entry is fairly standard), or our code is broken. I'm curious what pybtex.Entry items look like here after parsing.
I think there's a very weird thing going on. I tried also on my desktop, same issue. The problem is this: pybtex.Entry
never has an author
field for me and that's why the code injects an UNKNOWN
author. For me, all authors are inside entry.persons['author']
. But then when the code asks for a pretty print of the entry, the authors=
are there. This is how pybtex
documents as well, see this: https://docs.pybtex.org/api/parsing.html
>>> from pybtex.database import parse_file
>>> bib_data = parse_file('../examples/tugboat/tugboat.bib')
>>> print(bib_data.entries['Knuth:TB8-1-14'].fields['title'])
Mixing right-to-left texts with left-to-right texts
>>> for author in bib_data.entries['Knuth:TB8-1-14'].persons['author']:
... print(unicode(author))
Knuth, Donald
MacKay, Pierre
This makes me think whether we are using two completely different pybtex
, i.e. maybe an old fork which provided the authors within fields
and the one that gets installed (for me) through pip, which does not seem to provide this?
What happens if you try to import this version?
@InProceedings{D18-2029,
author = {Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray},
title = {Universal Sentence Encoder for English},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
year = {2018},
publisher = {Association for Computational Linguistics},
pages = {169--174},
location = {Brussels, Belgium},
url = {http://aclweb.org/anthology/D18-2029}
}
On Nov 20, 2018, at 9:22 AM, Ozan Çağlayan notifications@github.com wrote:
I think there's a very weird thing going on. I tried also on my desktop, same issue. The problem is this: pybtex.Entry never has an author field for me and that's why the code injects an UNKNOWN author. For me, all authors are inside entry.persons['author']. This is indeed how pybtex documents as well, see this: https://docs.pybtex.org/api/parsing.html https://docs.pybtex.org/api/parsing.html
from pybtex.database import parse_file bib_data = parse_file('../examples/tugboat/tugboat.bib') print(bib_data.entries['Knuth:TB8-1-14'].fields['title']) Mixing right-to-left texts with left-to-right texts for author in bib_data.entries['Knuth:TB8-1-14'].persons['author']: ... print(unicode(author)) Knuth, Donald MacKay, Pierre This makes me think whether we are using two completely different pybtex, i.e. maybe an old fork which provided the authors within fields and the one that gets installed (for me) through pip, which does not seem to provide this?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mjpost/bibsearch/issues/41#issuecomment-440290110, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbyWEsMKUpFBt1oqOMu5lFnqEEsi2Hwks5uxBApgaJpZM4YpK8M.
Saved it into a local file foo.tex
and then:
(base) [silver] ~ $ ipython -i `which bibsearch` -- add foo.tex
Python 3.6.6 |Anaconda custom (64-bit)| (default, Oct 9 2018, 12:34:16)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.
0%| | [Elapsed: 00:00 ETA: ?]> /home/caglayan/git/bibsearch/bibsearch/bibsearch.py(330)_add_file()
329 ipdb.set_trace()
--> 330 if db.add(entry):
331 added += 1
ipdb> 'author' in entry.fields
False
ipdb> entry
Entry('inproceedings', fields=[('title', 'Universal Sentence Encoder for English'), ('booktitle', 'Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations'), ('year', '2018'), ('publisher', 'Association for Computational Linguistics'), ('pages', '169--174'), ('location', 'Brussels, Belgium'), ('url', 'http://aclweb.org/anthology/D18-2029')], persons=OrderedCaseInsensitiveDict([('author', [Person('Cer, Daniel'), Person('Yang, Yinfei'), Person('Kong, Sheng-yi'), Person('Hua, Nan'), Person('Limtiaco, Nicole'), Person('St. John, Rhomni'), Person('Constant, Noah'), Person('Guajardo-Cespedes, Mario'), Person('Yuan, Steve'), Person('Tar, Chris'), Person('Strope, Brian'), Person('Kurzweil, Ray')])]))
Can you try to run this file?
#!/usr/bin/env python
import pybtex.database
BIBTEX="""\
@InProceedings{D18-2029,
author = {Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray},
title = {Universal Sentence Encoder for English},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
year = {2018},
publisher = {Association for Computational Linguistics},
pages = {169--174},
location = {Brussels, Belgium},
url = {http://aclweb.org/anthology/D18-2029}
}"""
if __name__ == '__main__':
print('Pybtex version: ', pybtex.__version__)
library = pybtex.database.parse_string(BIBTEX, 'bibtex')
entry = library.entries['D18-2029']
print('author in entry.fields?', 'author' in entry.fields)
print('author in entry.persons?', 'author' in entry.persons)
Output:
Pybtex version: 0.22.0
author in entry.fields? False
author in entry.persons? True
Pybtex version: 0.21
author in entry.fields? False
author in entry.persons? True
Then how can you escape from having UNKNOWN
author field? I don't get it, or maybe I'm missing something about the way the code works.
def add(self, entry: pybtex.Entry):
""" Returns if the entry was added or if it was a duplicate"""
# TODO: make this a better sanity checking and perhaps report errors
if not entry.key:
return False
if not entry.fields.get("author"):
entry.fields["author"] = "UNKNOWN"
Yes, I don't understand either. I tried downgrading to pybtex 0.20.0 and even 0.19.0, but still get False
on author in entry.fields
, even when I change the bib file to one that was imported correctly just yesterday.
I'll have to look into this later tonight, or maybe @davvil has an idea. This is strange.
If you remove your already generated bibdb, can you still add this entry correctly with author information?
Yeah, it works perfectly.
$ mv ~/.bibsearch ~/.bibsearch.bak
$ bibsearch add d.bib # contains the entry we're playing with
Added 1 entries, skipped 0 duplicates. Skipped 0 files
$ bibsearch print
@InProceedings{cer2018a:universal,
author = "Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray",
title = "Universal Sentence Encoder for English",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "169--174",
location = "Brussels, Belgium",
url = "http://aclweb.org/anthology/D18-2029",
original_key = "D18-2029"
}
On Nov 20, 2018, at 10:07 AM, Ozan Çağlayan notifications@github.com wrote:
If you remove your already generated bibdb, can you still add this entry correctly with author information?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mjpost/bibsearch/issues/41#issuecomment-440305763, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbyWB9ongVP5ALIihhwXl_EQGGwTQ5nks5uxBqkgaJpZM4YpK8M.
Looking through the code of pybtex, I see that the author
field is never carried along as a string but it directly is parsed into Persons. Thus, I still can't see how all the code paths in bibsearch
accessing to entry.fields['authors'] and parsing it with parse_names
may work.
You can try to reproduce in here: https://colab.research.google.com/gist/ozancaglayan/708fe24d2cecd67645e48943848af41f/bibsearch.ipynb
Sorry about the delay—I'll pick this up after NAACL.
Again, sorry about the delay. I was able to reproduce the issue on another computer and I have comited a fix for it. Please try the current master in github which should address this issue. After the three of us do some testing, we should update the pip package ASAP.
I am also buffled as to why it worked before. Perhaps we were using a byproduct of the parsing itself.
It seems to work on my side for the specific example above.
$ bibsearch add https://aclanthology.coli.uni-saarland.de/papers/D18-2029/d18-2029.bib
100%|██████████████████████████████████████████████| [Elapsed: 00:00 ETA: 00:00]
Added 1 entries, skipped 0 duplicates. Skipped 0 files
$ bibsearch print
@InProceedings{D18-2029,
author = "Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray",
title = "Universal Sentence Encoder for English",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "169--174",
location = "Brussels, Belgium",
url = "http://aclweb.org/anthology/D18-2029",
original_key = "D18-2029"
}
$ bibsearch find Yinfei
1. [D18-2029] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole
Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes,
Steve Yuan, Chris Tar, Brian Strope and Ray Kurzweil. 2018.
"Universal Sentence Encoder for English". Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing:
System Demonstrations. http://aclweb.org/anthology/D18-2029
I just fixed the key generation (it took the original key before) and I also fixed an error when importing entries with unknown macros. It seems to work quite well now. @mjpost what do you think? Can you update PyPi?
Sure, can you bump the version and add to the change log? Then I'll push.
Done! We are now at version π.
Pushed to pypi.
Hello,
Thanks for this wonderful project that I discovered this morning. I'm not sure if this is related to sqlite3 with no support for FTS but, i have a problem with author names (both during search and also in the returned keys):
Looking through the sqlite file, I see this: