mjpost / bibsearch

Download, manage, and search a BibTeX database.
Other
63 stars 5 forks source link

no author information in key #41

Closed ozancaglayan closed 5 years ago

ozancaglayan commented 5 years ago

Hello,

Thanks for this wonderful project that I discovered this morning. I'm not sure if this is related to sqlite3 with no support for FTS but, i have a problem with author names (both during search and also in the returned keys):

$ bibsearch search Yinfei                                                                                                                                                                                        
$ bibsearch search "Sentence Encoder"                                                                                                                                                                            
1. [unknown2018:universal] . 2018. "Universal Sentence Encoder for                                                                                                                                                                             
   English". Proceedings of the 2018 Conference on Empirical Methods                                                                                                                                                                           
   in Natural Language Processing: System Demonstrations.                                                                                                                                                                                      
   http://aclweb.org/anthology/D18-2029

Looking through the sqlite file, I see this:

D18-2029|unknown2018:universal|UNKNOWN|Universal Sentence Encoder for English|Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations|2018|@InProceedings{unknown2018:universal,         
    author = "Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray",      
    title = "Universal Sentence Encoder for English",
...
mjpost commented 5 years ago

We're glad you liked it! This looks like an import problem. This query works for me, and I have the following entry in my db:

D18-2029|cer2018a:universal|Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray|Universal Sentence Encoder for English|Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations|2018|@InProceedings{cer2018a:universal,

but I have FTS support.

How did you import this entry? What happens if you run

bib add https://aclanthology.coli.uni-saarland.de/papers/D18-2029/d18-2029.bib

@davvil, any guesses what might have caused this?

ozancaglayan commented 5 years ago

Is there way to quickly see whether sqlite is compiled with FTS support or not? But in any case FTS or no-FTS should involve in later parts of the processing, no?

UPDATE: There was a new release of pybtex from yesterday thus I downgraded it to 0.21 but still this snippet from bibdb.py fails for me. I have entry.persons but not entry.fields['author']

    215         if not entry.key:
    216             return False
    218         if not entry.fields.get("author"):
    219             entry.fields["author"] = "UNKNOWN"

EDIT: Tried this and it seems that it has FTS support:

sqlite> WITH opts(n, opt) AS (
   ...>   VALUES(0, NULL)
   ...>   UNION ALL
   ...>   SELECT n + 1,
   ...>          sqlite_compileoption_get(n)
   ...>   FROM opts
   ...>   WHERE sqlite_compileoption_get(n) IS NOT NULL
   ...> )
   ...> SELECT opt
   ...> FROM opts
   ...> WHERE opt LIKE '%FTS%';
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5
$ rm -rf .bibsearch
$ bibsearch add https://aclanthology.coli.uni-saarland.de/papers/D18-2029/d18-2029.bib
Added 1 entries, skipped 0 duplicates. Skipped 0 files
$ bibsearch find Yinfei
$ bibsearch print
@InProceedings{unknown2018:universal,
    author = "Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray",
    title = "Universal Sentence Encoder for English",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "169--174",
    location = "Brussels, Belgium",
    url = "http://aclweb.org/anthology/D18-2029",
    author = "UNKNOWN",
    original_key = "D18-2029"
}
davvil commented 5 years ago

I don't think this has to do with FTS, as FTS involves only search and the entry is found. What is quite suspicious is that the bibtex entry has two author fields, the second one being "UNKNOWN", which is probably what is taken for generating the key. I also suspect a problem when importing/parsing the entry. Unfortunately I am also not able to reproduce it on my system :-(

What version of pybtex are you using? Did this happen with an empty database or did you include the entry after adding others?

mjpost commented 5 years ago

Yes, this is quite strange to see two author fields. I agree it looks like a parsing error. Do you have time and interest to debug this? And can you provide more details about your environment (OS, python version, etc)?

ozancaglayan commented 5 years ago

Oh I didn't see the second author field above. Yes let me dig into it a little. Sorry for updating again and again my previous comment for which probably you did not receive separate notifications. I tried with pybtex 0.21 and 0.22 and got the same result.

mjpost commented 5 years ago

No need to apologize—we're happy to have someone point out a bug and go through the work of trying to fix it. I think it should be easy to track down: either pybtex parsing is broken (which would be strange, since this entry is fairly standard), or our code is broken. I'm curious what pybtex.Entry items look like here after parsing.

ozancaglayan commented 5 years ago

I think there's a very weird thing going on. I tried also on my desktop, same issue. The problem is this: pybtex.Entry never has an author field for me and that's why the code injects an UNKNOWN author. For me, all authors are inside entry.persons['author']. But then when the code asks for a pretty print of the entry, the authors= are there. This is how pybtex documents as well, see this: https://docs.pybtex.org/api/parsing.html

>>> from pybtex.database import parse_file
>>> bib_data = parse_file('../examples/tugboat/tugboat.bib')
>>> print(bib_data.entries['Knuth:TB8-1-14'].fields['title'])
Mixing right-to-left texts with left-to-right texts
>>> for author in bib_data.entries['Knuth:TB8-1-14'].persons['author']:
...     print(unicode(author))
Knuth, Donald
MacKay, Pierre

This makes me think whether we are using two completely different pybtex, i.e. maybe an old fork which provided the authors within fields and the one that gets installed (for me) through pip, which does not seem to provide this?

mjpost commented 5 years ago

What happens if you try to import this version?

@InProceedings{D18-2029,
  author = {Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray},
  title = {Universal Sentence Encoder for English},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  year = {2018},
  publisher = {Association for Computational Linguistics},
  pages = {169--174},
  location = {Brussels, Belgium},
  url = {http://aclweb.org/anthology/D18-2029}
}

On Nov 20, 2018, at 9:22 AM, Ozan Çağlayan notifications@github.com wrote:

I think there's a very weird thing going on. I tried also on my desktop, same issue. The problem is this: pybtex.Entry never has an author field for me and that's why the code injects an UNKNOWN author. For me, all authors are inside entry.persons['author']. This is indeed how pybtex documents as well, see this: https://docs.pybtex.org/api/parsing.html https://docs.pybtex.org/api/parsing.html

from pybtex.database import parse_file bib_data = parse_file('../examples/tugboat/tugboat.bib') print(bib_data.entries['Knuth:TB8-1-14'].fields['title']) Mixing right-to-left texts with left-to-right texts for author in bib_data.entries['Knuth:TB8-1-14'].persons['author']: ... print(unicode(author)) Knuth, Donald MacKay, Pierre This makes me think whether we are using two completely different pybtex, i.e. maybe an old fork which provided the authors within fields and the one that gets installed (for me) through pip, which does not seem to provide this?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mjpost/bibsearch/issues/41#issuecomment-440290110, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbyWEsMKUpFBt1oqOMu5lFnqEEsi2Hwks5uxBApgaJpZM4YpK8M.

ozancaglayan commented 5 years ago

Saved it into a local file foo.tex and then:

(base) [silver] ~ $ ipython -i `which bibsearch` -- add foo.tex 
Python 3.6.6 |Anaconda custom (64-bit)| (default, Oct  9 2018, 12:34:16) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.
  0%|                                                  | [Elapsed: 00:00 ETA: ?]> /home/caglayan/git/bibsearch/bibsearch/bibsearch.py(330)_add_file()
    329         ipdb.set_trace()
--> 330         if db.add(entry):
    331             added += 1

ipdb> 'author' in entry.fields                                                                                                                                                                                                                 
False
ipdb> entry                                                                                                                                                                                                                                    
Entry('inproceedings', fields=[('title', 'Universal Sentence Encoder for English'), ('booktitle', 'Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations'), ('year', '2018'), ('publisher', 'Association for Computational Linguistics'), ('pages', '169--174'), ('location', 'Brussels, Belgium'), ('url', 'http://aclweb.org/anthology/D18-2029')], persons=OrderedCaseInsensitiveDict([('author', [Person('Cer, Daniel'), Person('Yang, Yinfei'), Person('Kong, Sheng-yi'), Person('Hua, Nan'), Person('Limtiaco, Nicole'), Person('St. John, Rhomni'), Person('Constant, Noah'), Person('Guajardo-Cespedes, Mario'), Person('Yuan, Steve'), Person('Tar, Chris'), Person('Strope, Brian'), Person('Kurzweil, Ray')])]))
ozancaglayan commented 5 years ago

Can you try to run this file?

#!/usr/bin/env python
import pybtex.database

BIBTEX="""\
   @InProceedings{D18-2029,
      author = {Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray},
      title = {Universal Sentence Encoder for English},
      booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
      year = {2018},
      publisher = {Association for Computational Linguistics},
      pages = {169--174},
      location = {Brussels, Belgium},
      url = {http://aclweb.org/anthology/D18-2029}
    }"""

if __name__ == '__main__':
    print('Pybtex version: ', pybtex.__version__)

    library = pybtex.database.parse_string(BIBTEX, 'bibtex')
    entry = library.entries['D18-2029']
    print('author in entry.fields?', 'author' in entry.fields)
    print('author in entry.persons?', 'author' in entry.persons)

Output:

Pybtex version:  0.22.0
author in entry.fields? False
author in entry.persons? True
mjpost commented 5 years ago
Pybtex version:  0.21
author in entry.fields? False
author in entry.persons? True
ozancaglayan commented 5 years ago

Then how can you escape from having UNKNOWN author field? I don't get it, or maybe I'm missing something about the way the code works.

    def add(self, entry: pybtex.Entry):
        """ Returns if the entry was added or if it was a duplicate"""

        # TODO: make this a better sanity checking and perhaps report errors
        if not entry.key:
            return False
        if not entry.fields.get("author"):
            entry.fields["author"] = "UNKNOWN"
mjpost commented 5 years ago

Yes, I don't understand either. I tried downgrading to pybtex 0.20.0 and even 0.19.0, but still get False on author in entry.fields, even when I change the bib file to one that was imported correctly just yesterday.

I'll have to look into this later tonight, or maybe @davvil has an idea. This is strange.

ozancaglayan commented 5 years ago

If you remove your already generated bibdb, can you still add this entry correctly with author information?

mjpost commented 5 years ago

Yeah, it works perfectly.

$ mv ~/.bibsearch ~/.bibsearch.bak
$ bibsearch add d.bib # contains the entry we're playing with
Added 1 entries, skipped 0 duplicates. Skipped 0 files
$ bibsearch print
@InProceedings{cer2018a:universal,
    author = "Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray",
    title = "Universal Sentence Encoder for English",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "169--174",
    location = "Brussels, Belgium",
    url = "http://aclweb.org/anthology/D18-2029",
    original_key = "D18-2029"
}

On Nov 20, 2018, at 10:07 AM, Ozan Çağlayan notifications@github.com wrote:

If you remove your already generated bibdb, can you still add this entry correctly with author information?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mjpost/bibsearch/issues/41#issuecomment-440305763, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbyWB9ongVP5ALIihhwXl_EQGGwTQ5nks5uxBqkgaJpZM4YpK8M.

ozancaglayan commented 5 years ago

Looking through the code of pybtex, I see that the author field is never carried along as a string but it directly is parsed into Persons. Thus, I still can't see how all the code paths in bibsearch accessing to entry.fields['authors'] and parsing it with parse_names may work.

ozancaglayan commented 5 years ago

You can try to reproduce in here: https://colab.research.google.com/gist/ozancaglayan/708fe24d2cecd67645e48943848af41f/bibsearch.ipynb

mjpost commented 5 years ago

Sorry about the delay—I'll pick this up after NAACL.

davvil commented 5 years ago

Again, sorry about the delay. I was able to reproduce the issue on another computer and I have comited a fix for it. Please try the current master in github which should address this issue. After the three of us do some testing, we should update the pip package ASAP.

I am also buffled as to why it worked before. Perhaps we were using a byproduct of the parsing itself.

ozancaglayan commented 5 years ago

It seems to work on my side for the specific example above.


$ bibsearch add https://aclanthology.coli.uni-saarland.de/papers/D18-2029/d18-2029.bib
100%|██████████████████████████████████████████████| [Elapsed: 00:00 ETA: 00:00]
Added 1 entries, skipped 0 duplicates. Skipped 0 files

$ bibsearch print
@InProceedings{D18-2029,
    author = "Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray",
    title = "Universal Sentence Encoder for English",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "169--174",
    location = "Brussels, Belgium",
    url = "http://aclweb.org/anthology/D18-2029",
    original_key = "D18-2029"
}

$ bibsearch find Yinfei
1. [D18-2029] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole
   Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes,
   Steve Yuan, Chris Tar, Brian Strope and Ray Kurzweil. 2018.
   "Universal Sentence Encoder for English". Proceedings of the 2018
   Conference on Empirical Methods in Natural Language Processing:
   System Demonstrations. http://aclweb.org/anthology/D18-2029
davvil commented 5 years ago

I just fixed the key generation (it took the original key before) and I also fixed an error when importing entries with unknown macros. It seems to work quite well now. @mjpost what do you think? Can you update PyPi?

mjpost commented 5 years ago

Sure, can you bump the version and add to the change log? Then I'll push.

davvil commented 5 years ago

Done! We are now at version π.

mjpost commented 5 years ago

Pushed to pypi.