perrette / papers

Command-line tool to manage bibliography (pdfs + bibtex)
MIT License
142 stars 22 forks source link

scholarly.scholarly not found? #28

Closed boyanpenkov closed 1 year ago

boyanpenkov commented 1 year ago

Hello folks -- @perrette first off, very very glad to see this fantastic project, and very much considering replacing my workflow (that relies on one of the particularly outdated proprietary packages you have listed on your front page) entirely with this! Thanks kindly! Thing is, I have a local library of 80 Gb of PDFs that's a good set of test cases here...

When I try papers extract yanofsky_qc.pdf --scholar, which should work, I get:

ModuleNotFoundError: No module named 'scholarly.scholarly'

This is with pip install papers-cli which may be out of date...

Anybody else seeing this?

boyanpenkov commented 1 year ago

If I just run papers extract yanofsky_qc.pdf, this does return a correctly formatted bibtex entry, but happens to be the wrong one, hence my want to try Google scholar here...

perrette commented 1 year ago

Hi @boyanpenkov , thanks for the feedback. This package is not far from usable, but unfortunately it does require some more work to make it actually useful. And as you point out, it seems outdated w.r.t. some dependencies. I'll see whether i can at least fix those later today.

boyanpenkov commented 1 year ago

Super -- thanks kindly; would be glad to help out here, especially since I have a significant set of test cases to check against, so please let me know if there's any snippets you'd like me to run (completely serious!).

My workflow for the last 12 years has been:

-- dump PDFs in folder, read them in emacs -- use "proprietary solution" to get PDF metadata, rename file appropriately and cp it to "organized" folder or subfolder -- per PDF, add Bibtex metadata to library.bib that my individual paper repos then depend on.

I started writing code to reproduce this workflow yesterday, and got as far as validating DOIs using https://github.com/MicheleCotrufo/pdf2doi and https://pypi.org/project/isbnlib/ before I realized the fuzzy matching here was the way to go, since the error rate is pretty high. I look forward to trying to reproduce this workflow using papers, and making contributions here...

perrette commented 1 year ago

I was not aware of pdf2doi. Actually it would make sense to concentrate efforts in one project to extract the proper DOI, and then re-use it in projects like this one. But well, again it needs time.

More directly though, for the scholarly issue, you simply need to install the dependency pip install -U scholarly

boyanpenkov commented 1 year ago

I was not aware of pdf2doi. Actually it would make sense to concentrate efforts in one project to extract the proper DOI, and then re-use it in projects like this one. But well, again it needs time.

Yep, and I think papers got started first here, and the feature-set is closer to what I'm after...

More directly though, for the scholarly issue, you simply need to install the dependency pip install -U scholarly

Regrettably, I have to confirm I did have scholarly installed. On my system:

Python 3.8.16 (default, Mar  2 2023, 03:21:46) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scholarly
>>> import scholarly.scholarly
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'scholarly.scholarly'
perrette commented 1 year ago

´´Before I look in more details in your issue, would you mind to test the current dev branch? Do you know how to do that? I think the latest version works OK (the tests run fine -- though I might not have tests for scholarly).

https://github.com/perrette/papers/archive/refs/heads/dev.zip from extracted dir: pip install . should work OK

I'll check (and if necessary fix) in a few hours. And later update on pypi.

perrette commented 1 year ago

I confirm this was fixed here: https://github.com/perrette/papers/commit/1b661e5277832e4876d03eabda71267e74c1a709 Soon to be merged in master.

boyanpenkov commented 1 year ago

Super -- thanks kindly! I pulled your archive down, and installed it. However, the traceback now reads:

(python311) → testing renamer/stage papers extract yanofsky_qc.pdf --scholar                                                20:09:24
Traceback (most recent call last):
  File "/home/boyan/boyanshouse/miniconda3/envs/python311/bin/papers", line 5, in <module>
    papers.bib.main()
  File "/home/boyan/boyanshouse/miniconda3/envs/python311/lib/python3.11/site-packages/papers/bib.py", line 1388, in main
    extractcmd(o)
  File "/home/boyan/boyanshouse/miniconda3/envs/python311/lib/python3.11/site-packages/papers/bib.py", line 1313, in extractcmd
    print(extract_pdf_metadata(o.pdf, search_doi=not o.fulltext, search_fulltext=True, scholar=o.scholar, minwords=o.word_count, max_query_words=o.word_count, image=o.image))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/boyan/boyanshouse/miniconda3/envs/python311/lib/python3.11/site-packages/papers/extract.py", line 206, in extract_pdf_metadata
    return extract_txt_metadata(txt, search_doi, search_fulltext, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/boyan/boyanshouse/miniconda3/envs/python311/lib/python3.11/site-packages/papers/extract.py", line 193, in extract_txt_metadata
    bibtex = fetch_bibtex_by_fulltext_scholar(query_txt)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/boyan/boyanshouse/miniconda3/envs/python311/lib/python3.11/site-packages/papers/config.py", line 224, in decorated
    res = cache[key] = fun(doi)
                       ^^^^^^^^
  File "/home/boyan/boyanshouse/miniconda3/envs/python311/lib/python3.11/site-packages/papers/extract.py", line 258, in fetch_bibtex_by_fulltext_scholar
    score = _scholar_score(txt, res.bib)
                                ^^^^^^^
AttributeError: 'dict' object has no attribute 'bib'

Please note that this is on python 3.11 now, instead of the 3.8 I was testing on this morning (chardet would not play nice with that one...).

If this is getting annoying and you can tell me what your CONTRIBUTING.md looks like, I can try to debug...

perrette commented 1 year ago

Hi, unfortunately I don't have python 3.11 installed right now. I just finished to implement other long-awaited changes and pushed a version 2 to pypi. You might try that one pip install -U papers-cli though I don't think I did any work on scholarly so I don't expect that will fix your issue. And I do not have structured contribution guidelines to offer at this point, sorry. Others have just cloned and made a pull request. If you have specific questions I am glad to answer.

Here we have one of two situations:

If you like me to take a look, you can just drop me your PDF and sum up the set of commands causing the issue. Disconnecting for now. Not sure when I'll have time again...

In case you find what's wrong, it would be great to add a test, too.

perrette commented 1 year ago

In any case, papers extract --scholar somepaper.pdf definitely works for me. For now I'll just class as not reprocible until news.

perrette commented 1 year ago

I updated to v2.1 with better pip/pyproject.toml distribution. Locally it also passes the tests with py311 (does not work with github CI + tox yet). I'm closing it for now. Please re-open if the issue persists.

boyanpenkov commented 1 year ago

Ok, after some more poking around, I do see that with a bunch of other pdfs, both --scholar and without --scholar work, so the issue could be specific to the subset of files I had chosen. To confirm, this is on papers-cli 2.1.1, running under the 3.11 interpreter.

I think I'll clone and poke around, and issue PR's as needed; re: CONTRIBUTING.md, if you end up wanting things like flake8 or black, please let me know and I'll end up cleaning them up.

Again, thanks for your responsiveness here, and I look forward to seeing what's up!