papis / papis

Powerful and highly extensible command-line based document and bibliography manager.
http://papis.readthedocs.io/en/latest/
GNU General Public License v3.0
1.43k stars 100 forks source link

Thoughts on Performance and Scalability #321

Open j-c-w opened 3 years ago

j-c-w commented 3 years ago

Hi all,

As my library size grows, I have noticed papis slowing down considerably. I have previously looked into making papis faster, but have come up short on ideas for how to fix the issue. Looking at performance profiles, the obvious way forward seems to be parallelization, but Alejandro has already tried that, and my own experiments with multiprocessing confirm what he found in (I can't find the issue number but I know I have seen it, anyway, Python multiprocessing doesn't seem to work to solve the problem).

The response times of papis are on the very slow end for me right now, I'm looking at 10+ second response times, which is getting to be too much. I have been using whoosh for some time, but it doesn't seem to skip the challenging part of this for papis, which is the file loading.

Now, I also want to point out that more performance should be possible. In my library of 737 items, running:

time papis list -a "year:2017" takes 11.99s, 11.80, 12.16

Grep performs much better (on a much harder task as it also scans PDF files!): time grep -nr '2017' ./* takes 5.47, 5.72, 6.05s

Excluding PDF files from grep: time grep -nr '2017' ./*.yaml takes 0.11s, 0.13s, 0.11s

rip-grep performs better yet: time rg 2017 0.31s, 0.28s, 0.29

So clearly, the disk speed is not the limiting factor. Perhaps someone with an understanding of Python file-handling can suggest some fixes? Or at least why it is slow and multiprocessing doesn't seem to help?

I'm happy to have a look at implementing ideas that seem sensible.

I do have some ideas to skirt this problem without addressing the real problem at hand (i.e. the slow file loading):

  1. A ripgrep/grep-based backend. Leave the files on the disk, and use ripgrep to produce a list that can be handled by papis (which can then load all the matching yaml files)
  2. A unified index file, where the content of the yaml files is joined into a single file. THis should allow for much faster searching by reducing the number of files that have to be opened to perform a search (would have to be recreated on reindexing, but should be fairly fast, on my library, cat */*.yaml > all.yaml takes approx 0.2s.)

I'm keen to hear other suggestions, and thoughts on the options I've suggested above, because this performance issue is starting to get a little bit out of hand in my library.

Jackson

alejandrogallo commented 3 years ago

Hi!

Thank you for this comment.

Quite frankly I'm a little bit puzzled by your timings, for me, doing

time papis list -a | wc -l

gets me

962

real    0m0.411s
user    0m0.376s
sys 0m0.040s

and doing a filtering

time papis list -a year:1981 | wc -l

ends in

6

real    0m0.538s
user    0m0.492s
sys     0m0.134s

I do not wish to jump into hasty conclusions, but it kind of seems like you're not using the cache

time papis -s use-cache false list -a year:1981 | wc -l

results in

6

real    0m5.094s
user    0m31.306s
sys     0m0.290s

which would be coherent with your timings (kind of). I'd like to get to the bottom of this.

If needed be, I think a (rip)grep backend would be doable and useful. However I haven't come across the need for this.

Could you tell me what

papis config use-cache

gives?

So that you understand exactly how the papis backend works, it's almost deceptively simple, in principle I create the list of all documents which gets created by reading every yaml file in every document. The parsing of the yaml documents is slow, but this is only done at indexing time. When this list is there I serialize it and store it in

~/.cache/papis/database/e700d592d0806fcab1fc70d2b6beac96-papers

which is just a pickle serialization of the list. Later, reading the documents is just done by deserializing this file, and it runs fast. That's why I can't understand your timings, it would only make sense if your use-cache is set to False, however I might be wrong and it would be interesting if other people share their experience.

Cheers!

j-c-w commented 3 years ago

Hi Alejandro,

Running

papis config use-cache results in True.

Here's my config:

[papers]
dir = ~/Documents/papers
                                                                           [settings]
default-library = papers                                                   
database-backend = whoosh
whoosh-schema-fields = ['year', 'author', 'publisher', 'journal', 'pages', 
'group', 'ref']
opentool = qpdfview --unique
opentool-txt = gvim
editor = vim
sort-field = time-added
bibtex-unicode = true

And here is the top of the profile if I run papis --profile prof list "year:2017" -a

         1161597 function calls (1152383 primitive calls) in 7.053 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    39679    0.806    0.000    0.838    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/reader.py:99(forward)
   236018    0.455    0.000    0.460    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/reader.py:87(peek)
     3514    0.435    0.000    1.433    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:1270(scan_plain)
    61426    0.365    0.000    0.745    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:145(need_more_tokens)
    65489    0.315    0.000    0.315    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:279(stale_possible_simple_keys)
    36647    0.289    0.000    4.573    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:113(check_token)
     4878    0.221    0.000    0.866    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/parser.py:273(parse_node)
     7799    0.217    0.000    0.676    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:1185(scan_flow_scalar_non_spaces)
    32882    0.201    0.000    0.231    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/reader.py:94(prefix)
     7942    0.199    0.000    3.666    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:156(fetch_more_tokens)
    22296    0.166    0.000    0.261    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/reader.py:114(get_mark)
     7076    0.129    0.000    0.322    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:1311(scan_plain_spaces)
    57547    0.119    0.000    0.119    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:264(next_possible_simple_key)
  4878/50    0.119    0.000    5.895    0.118 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/composer.py:63(compose_node)
    15294    0.118    0.000    5.484    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/parser.py:94(check_event)
     7942    0.117    0.000    0.283    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:752(scan_to_next_token)
     7379    0.115    0.000    0.283    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:1228(scan_flow_scalar_spaces)
    85003    0.114    0.000    0.125    0.000 {built-in method builtins.isinstance}
      420    0.098    0.000    1.116    0.003 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:1134(scan_flow_scalar)
    22296    0.096    0.000    0.096    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/error.py:6(__init__)
    10794    0.082    0.000    0.234    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:135(get_token)
     2028    0.075    0.000    0.175    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:545(fetch_value)
     3934    0.075    0.000    0.201    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/composer.py:88(compose_scalar_node)
85761/85001    0.072    0.000    0.074    0.000 {built-in method builtins.len}
     4878    0.067    0.000    0.131    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/constructor.py:59(construct_object)
       79    0.066    0.001    0.066    0.001 {built-in method marshal.loads}
     4878    0.066    0.000    0.105    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/resolver.py:143(resolve)
   750/50    0.062    0.000    5.879    0.118 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/composer.py:117(compose_mapping_node)
    58580    0.061    0.000    0.061    0.000 {method 'append' of 'list' objects}
  551/164    0.061    0.000    0.163    0.001 /nix/store/r94aa2gj4drkhfvkm2p4ab6cblb6kxlq-python3-3.7.6/lib/python3.7/sre_parse.py:469(_parse)
     4370    0.059    0.000    0.112    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:295(save_possible_simple_key)
  463/400    0.058    0.000    0.094    0.000 {built-in method builtins.__build_class__}
     9701    0.054    0.000    0.109    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:1416(scan_line_break)
     3514    0.043    0.000    1.575    0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:668(fetch_plain)
     1739    0.039    0.000    0.941    0.001 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/parser.py:427(parse_block_mapping_key)

Basically all the time is spent in PyYaml which is where I came to my conclusions from.

To contrast, I'm very surprised by your timing results! It does indeed suggest that the configuration I have is the root of the problem here.

alejandrogallo commented 3 years ago

ok! it does not make any sense that pyyaml is being called, since it should be only called at indexing time. I tested also with whoosh and I'm unable to reproduce your timings, it's weird. But do not despair, we'll get to the bottom of this!

I have two hunches

Rights

tell me the rights of your cache directory, i.e.

ls -Ral ~/.cache/papis

what it would make sense is that papis has no writing rights in ~/.cache/papis and it's not writing the cache, so that the files never exist and it indexes every time.

Papis backend

Please try also with the papis backend,

[papers]
database-backend = papis

and report your experience.

j-c-w commented 3 years ago

ls -Ral ~/.cache/papis produces

total 4
drwxr-xr-x  3 jackson users  3 Mar  4  2020 .
drwxr-xr-x 28 jackson users 34 Jan 21 12:29 ..
drwxr-xr-x  3 jackson users  3 Mar  4  2020 database

/home/jackson/.cache/papis/database:
total 3
drwxr-xr-x 3 jackson users 3 Mar  4  2020 .
drwxr-xr-x 3 jackson users 3 Mar  4  2020 ..
drwxr-xr-x 5 jackson users 5 Feb 16 10:55 whoosh

/home/jackson/.cache/papis/database/whoosh:
total 6
drwxr-xr-x 5 jackson users 5 Feb 16 10:55 .
drwxr-xr-x 3 jackson users 3 Mar  4  2020 ..
drwxr-xr-x 2 jackson users 7 Feb 21 18:59 fba3ecf302b9ff695b11a01bb324e355-papers

/home/jackson/.cache/papis/database/whoosh/fba3ecf302b9ff695b11a01bb324e355-papers:
total 2008
drwxr-xr-x 2 jackson users       7 Feb 21 18:59 .
drwxr-xr-x 5 jackson users       5 Feb 16 10:55 ..
-rw-r--r-- 1 jackson users   68551 Feb 21 16:41 MAIN_332saomtlpw25ued.seg
-rw-r--r-- 1 jackson users    3927 Feb 21 18:59 _MAIN_48.toc
-rw-r--r-- 1 jackson users    8021 Feb 21 18:59 MAIN_5b2qho7gxppac0dw.seg
-rw-r--r-- 1 jackson users 1865720 Feb 16 10:55 MAIN_hkjolg3oapl3oa1d.seg
-rwxr-xr-x 1 jackson users       0 Feb 16 10:55 MAIN_WRITELOCK

Which seems OK to me? (Perhaps I am missing something?)

Setting database-backend = papis drastically improves performance though --- that is quite interesting. Runtimes are 1.38s, 1.33s and 1.25s; still much slower than grep, but manageable.

I'm happy to call this addressed for the time being from my perspective, although there is something funny going on with whoosh here that likely should be fixed. Provided there's nothing obviously wrong with my permissions, I'll have a dig through the whoosh calls at some point soon and see what I find.

avonmoll commented 3 years ago

I have almost the exact same experience and same relevant configuration and output as @j-c-w for all of these diagnostic steps. I think that the whoosh DB is being created but not accessed later on. The times for use-cache false are pretty similar. Below is a record of me deleting any existing cache and testing/timing a few relevant commands. The first call to papis generates the index. The next two are making same call but toggling the use-cache flag. The last 2 calls are using the papis DB.

$ rm -rf ~/.cache/papis
$ time papis list -a | wc -l
     394
    papis list -a  5.84s user 0.33s system 173% cpu 3.564 total
    wc -l  0.00s user 0.00s system 0% cpu 3.564 total
$ time papis list -a | wc -l
     394
    papis list -a  2.42s user 0.15s system 98% cpu 2.611 total
    wc -l  0.00s user 0.00s system 0% cpu 2.609 total
$ time papis -s use-cache false list -a | wc -l
     394
    papis -s use-cache false list -a  2.41s user 0.15s system 98% cpu 2.596 total
    wc -l  0.00s user 0.00s system 0% cpu 2.595 total

# change backend to papis
$ time papis list -a | wc -l
    INFO:db:cache:Indexing library, this might take a while
    WARNING:cli:list:No documents retrieved
       0
    papis list -a  3.79s user 0.40s system 306% cpu 1.368 total
    wc -l  0.00s user 0.00s system 0% cpu 1.367 total
$ time papis list -a | wc -l
    WARNING:cli:list:No documents retrieved
       0
    papis list -a  0.64s user 0.20s system 97% cpu 0.869 total
    wc -l  0.00s user 0.00s system 0% cpu 0.868 total
$ papis config use-cache
    True
alejandrogallo commented 3 years ago

@j-c-w can you paste here a debug output of a search?

PAPIS_DEBUG=1 papis -v list asdfasdfasdf

I would like to also discard the multiprocessing problem, so I'll be cleaning this part of papis this week so that we can also test the binary running serially or multicore like this. It might be also related to this. However I don't think it's not using the whoosh database since it's matching documents, and for this to happen the whoosh backend ought to at least work.

j-c-w commented 3 years ago

OK. So here's something very odd. I changed to the papis backend for a while, and then back to whoosh. I may have also deleted the database, I can't remember. Now that I'm back in whoosh, my timings are fast and manageable.

Running the same things as before time papis list -a 'year:2017' now results in times of approx 1.60s.

Looking at the profile, it's still mostly yaml, but obviously much less overhead. Perhaps there is some issue with the database being corrupted that results in re-loading?

In any case, here's the (probably much less useful) output of PAPIS_DEBUG=1 papis -v list asdfasdfasdf:

7-DEBUG:config:importing
389-DEBUG:config:Creating configuration
389-DEBUG:config:Getting config file /home/jackson/.config/papis/config
389-DEBUG:Configuration:Reading configuration from /home/jackson/.config/papis/config
392-DEBUG:utils:importing
393-DEBUG:api:importing
395-DEBUG:papis:plugin:creating manager for papis.command
400-DEBUG:stevedore.extension:found extension EntryPoint.parse('add = papis.commands.add:cli')
531-DEBUG:papis:plugin:creating manager for papis.importer
531-DEBUG:stevedore.extension:found extension EntryPoint.parse('arxiv = papis.arxiv:Importer')
547-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.bibtex:Importer')
552-DEBUG:stevedore.extension:found extension EntryPoint.parse('crossref = papis.crossref:FromCrossrefImporter')
568-DEBUG:crossref:importing
569-DEBUG:stevedore.extension:found extension EntryPoint.parse('doi = papis.crossref:Importer')
574-DEBUG:stevedore.extension:found extension EntryPoint.parse('folder = papis.commands.add:FromFolderImporter')
579-DEBUG:stevedore.extension:found extension EntryPoint.parse('lib = papis.commands.add:FromLibImporter')
585-DEBUG:stevedore.extension:found extension EntryPoint.parse('pdf2arxivid = papis.arxiv:ArxividFromPdfImporter')
590-DEBUG:stevedore.extension:found extension EntryPoint.parse('pdf2doi = papis.crossref:DoiFromPdfImporter')
595-DEBUG:stevedore.extension:found extension EntryPoint.parse('pmid = papis.pubmed:Importer')
600-DEBUG:stevedore.extension:found extension EntryPoint.parse('yaml = papis.yaml:Importer')
631-DEBUG:stevedore.extension:found extension EntryPoint.parse('addto = papis.commands.addto:cli')
636-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.commands.bibtex:cli')
642-DEBUG:papis:plugin:creating manager for papis.exporter
642-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.bibtex:exporter')
648-DEBUG:stevedore.extension:found extension EntryPoint.parse('json = papis.json:exporter')
654-DEBUG:stevedore.extension:found extension EntryPoint.parse('yaml = papis.yaml:exporter')
661-DEBUG:papis:plugin:creating manager for papis.explorer
661-DEBUG:stevedore.extension:found extension EntryPoint.parse('add = papis.commands.explore:add')
667-DEBUG:stevedore.extension:found extension EntryPoint.parse('arxiv = papis.arxiv:explorer')
673-DEBUG:stevedore.extension:found extension EntryPoint.parse('base = papis.base:explorer')
679-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.bibtex:explorer')
685-DEBUG:stevedore.extension:found extension EntryPoint.parse('citations = papis.commands.explore:citations')
691-DEBUG:stevedore.extension:found extension EntryPoint.parse('cmd = papis.commands.explore:cmd')
696-DEBUG:stevedore.extension:found extension EntryPoint.parse('crossref = papis.crossref:explorer')
701-DEBUG:stevedore.extension:found extension EntryPoint.parse('dissemin = papis.dissemin:explorer')
707-DEBUG:stevedore.extension:found extension EntryPoint.parse('export = papis.commands.export:explorer')
712-DEBUG:stevedore.extension:found extension EntryPoint.parse('isbn = papis.isbn:explorer')
733-DEBUG:stevedore.extension:found extension EntryPoint.parse('isbnplus = papis.isbnplus:explorer')
739-DEBUG:stevedore.extension:found extension EntryPoint.parse('json = papis.json:explorer')
744-DEBUG:stevedore.extension:found extension EntryPoint.parse('lib = papis.commands.explore:lib')
749-DEBUG:stevedore.extension:found extension EntryPoint.parse('pick = papis.commands.explore:pick')
754-DEBUG:stevedore.extension:found extension EntryPoint.parse('yaml = papis.yaml:explorer')
763-DEBUG:stevedore.extension:found extension EntryPoint.parse('browse = papis.commands.browse:cli')
769-DEBUG:stevedore.extension:found extension EntryPoint.parse('config = papis.commands.config:cli')
774-DEBUG:stevedore.extension:found extension EntryPoint.parse('edit = papis.commands.edit:cli')
780-DEBUG:stevedore.extension:found extension EntryPoint.parse('explore = papis.commands.explore:cli')
786-DEBUG:stevedore.extension:found extension EntryPoint.parse('export = papis.commands.export:cli')
791-DEBUG:stevedore.extension:found extension EntryPoint.parse('git = papis.commands.git:cli')
800-DEBUG:stevedore.extension:found extension EntryPoint.parse('list = papis.commands.list:cli')
806-DEBUG:stevedore.extension:found extension EntryPoint.parse('mv = papis.commands.mv:cli')
813-DEBUG:stevedore.extension:found extension EntryPoint.parse('open = papis.commands.open:cli')
819-DEBUG:stevedore.extension:found extension EntryPoint.parse('rename = papis.commands.rename:cli')
826-DEBUG:stevedore.extension:found extension EntryPoint.parse('rm = papis.commands.rm:cli')
833-DEBUG:stevedore.extension:found extension EntryPoint.parse('run = papis.commands.run:cli')
838-DEBUG:stevedore.extension:found extension EntryPoint.parse('update = papis.commands.update:cli')
929-DEBUG:db:whoosh:Initialized index found for library
929-DEBUG:db:whoosh:Query string asdfasdfasdf
945-DEBUG:db:whoosh:<Top 0 Results for Or([Term('title', 'asdfasdfasdf'), Term('author', 'asdfasdfasdf'), Term('tags', 'asdfasdfasdf')]) runtime=0.0009825390006881207>
946-DEBUG:document:sorting 0 documents
946-WARNING:cli:list:No documents retrieved
947-DEBUG:papis:plugin:creating manager for papis.picker
947-DEBUG:stevedore.extension:found extension EntryPoint.parse('papis = papis.tui.picker:Picker')

But I note that this was not a slow run by any means (0.82s). Perhaps @avonmoll can run the same test if his library is still slow? I expect papis to slow down on me here at some point, it usually does. I'll run that again when it is going slowly.

avonmoll commented 3 years ago

I have the same output as @j-c-w, but my whoosh times are still ~2.4s when I do a title:* query, which grabs all ~400 documents. I re-ran the profiler and get output similar to @j-c-w's output, which shows almost all of the time is spent in YAML-related functions. I note that whoosh times are smaller when the query is more specific. Do I understand correctly that whoosh queries the DB and that once matches are found each match's YAML is then read from disk in order to populate the info displayed in the picker?

alexfikl commented 3 years ago

@avonmoll That seems to match my experience as well. Doing a specific query with the year gives

$ time papis list -a "year:2017" | wc -l
41
papis list -a "year:2017"  0.59s user 0.03s system 99% cpu 0.618 total,

so reasonably swift. However, doing a catch all query for the title is very slow

$ time papis list -a "title:*" | wc -l
1330
papis list -a "title:*"  9.00s user 0.08s system 99% cpu 9.092 total

Like you said, after doing the whoosh query, it starts loading yaml files in databases/whoosh.py::query to get the full info. Maybe papis can store enough information in the db to construct a dummy document and only read the full info.yaml when an exact match is selected in the picker?

alejandrogallo commented 3 years ago

Ok, I more or less have sorted it out, I want to ask you guys what do you prefer, since I do not use whoosh.

So the problem is indeed the conversion to yaml, I've made several tests and now I reproduce your issues.

When we do

papis list -a title:*

in my case then whoosh does

documents = [
    papis.document.from_folder(r.get(Database.get_id_key()))
    for r in results]

as you all have pointed out. This means that yaml will be parsed, in my case for 900 documents, which is absolutely not the way to go, since yaml parsing in python is regrettably slow.

Remember I have 950 documents in my library, so these my timings for papis list -a title:* using whoosh

Implementation Time (s)
v0.11.1 10.6 ± 0.320
papis.utils.parmap 3.545 ± 0.046
caching whoosh entry 0.6688 ± 0.0117

Caching the whoosh entry means that what will be shown in the picker has only the information that is used for indexing the whoosh documents, so the whoosh fields, which I guess is enough, and judging from the timings it is well worth it.

papis.utils.parmap is simply parsing paralelly the yaml files, which in my opinion is just bound to be very slow in HDDs and it's not the way to go to touch yaml parsing.

I let you whoosh users decide which one you prefer. Maybe I was not clear in some points so feel free to ask!

Appendix

Here you see the results of the benchmarks taken with hyperfine

alejandrogallo commented 3 years ago

However on a second note, this is also not reall a solution because then in commands like

papis export --format bibtex

then only the cached information will be exported, and doing a loading of the document for whoosh before exporting is not really general... I will try to cache the dictionary data at indexing time in whoosh just like in the papis database, and then we can retrieve deserialize it when getting the matches, anyways we now know where te problem lies.

j-c-w commented 3 years ago

OK, the poor performance has returned, so I'll run the tests you asked for Alejandro. That said, I think the idea you talk about above, with caching the dictionary data is great, and I do think that it solves the larger (non-buggy) performance problem.

Anyway:

time papis list -a 'year:2017' | wc -l Results in: 50 papis list -a 'year:2017' 7.22s user 0.29s system 99% cpu 7.515 total wc -l 0.00s user 0.00s system 0% cpu 7.515 total

So we're back to being very slow again. Running with PAPIS_DEBUG:

time PAPIS_DEBUG=1 papis -v list asdfasdfasdf Results in: 6.72s user 0.50s system 99% cpu 7.287 total (so a bit faster, but still /very/ slow for a query that matches no items).

64-DEBUG:config:importing
3288-DEBUG:config:Creating configuration
3291-DEBUG:config:Getting config file /home/jackson/.config/papis/config
3292-DEBUG:Configuration:Reading configuration from /home/jackson/.config/papis/config
3306-DEBUG:utils:importing
3311-DEBUG:api:importing
3321-DEBUG:papis:plugin:creating manager for papis.command
3343-DEBUG:stevedore.extension:found extension EntryPoint.parse('add = papis.commands.add:cli')
4051-DEBUG:papis:plugin:creating manager for papis.importer
4052-DEBUG:stevedore.extension:found extension EntryPoint.parse('arxiv = papis.arxiv:Importer')
4117-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.bibtex:Importer')
4137-DEBUG:stevedore.extension:found extension EntryPoint.parse('crossref = papis.crossref:FromCrossrefImporter')
4206-DEBUG:crossref:importing
4209-DEBUG:stevedore.extension:found extension EntryPoint.parse('doi = papis.crossref:Importer')
4230-DEBUG:stevedore.extension:found extension EntryPoint.parse('folder = papis.commands.add:FromFolderImporter')
4251-DEBUG:stevedore.extension:found extension EntryPoint.parse('lib = papis.commands.add:FromLibImporter')
4274-DEBUG:stevedore.extension:found extension EntryPoint.parse('pdf2arxivid = papis.arxiv:ArxividFromPdfImporter')
4296-DEBUG:stevedore.extension:found extension EntryPoint.parse('pdf2doi = papis.crossref:DoiFromPdfImporter')
4317-DEBUG:stevedore.extension:found extension EntryPoint.parse('pmid = papis.pubmed:Importer')
4339-DEBUG:stevedore.extension:found extension EntryPoint.parse('yaml = papis.yaml:Importer')
4484-DEBUG:stevedore.extension:found extension EntryPoint.parse('addto = papis.commands.addto:cli')
4513-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.commands.bibtex:cli')
4544-DEBUG:papis:plugin:creating manager for papis.exporter
4545-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.bibtex:exporter')
4570-DEBUG:stevedore.extension:found extension EntryPoint.parse('json = papis.json:exporter')
4597-DEBUG:stevedore.extension:found extension EntryPoint.parse('yaml = papis.yaml:exporter')
4627-DEBUG:papis:plugin:creating manager for papis.explorer
4628-DEBUG:stevedore.extension:found extension EntryPoint.parse('add = papis.commands.explore:add')
4653-DEBUG:stevedore.extension:found extension EntryPoint.parse('arxiv = papis.arxiv:explorer')
4680-DEBUG:stevedore.extension:found extension EntryPoint.parse('base = papis.base:explorer')
4708-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.bibtex:explorer')
4733-DEBUG:stevedore.extension:found extension EntryPoint.parse('citations = papis.commands.explore:citations')
4759-DEBUG:stevedore.extension:found extension EntryPoint.parse('cmd = papis.commands.explore:cmd')
4784-DEBUG:stevedore.extension:found extension EntryPoint.parse('crossref = papis.crossref:explorer')
4809-DEBUG:stevedore.extension:found extension EntryPoint.parse('dissemin = papis.dissemin:explorer')
4837-DEBUG:stevedore.extension:found extension EntryPoint.parse('export = papis.commands.export:explorer')
4862-DEBUG:stevedore.extension:found extension EntryPoint.parse('isbn = papis.isbn:explorer')
4966-DEBUG:stevedore.extension:found extension EntryPoint.parse('isbnplus = papis.isbnplus:explorer')
4995-DEBUG:stevedore.extension:found extension EntryPoint.parse('json = papis.json:explorer')
5022-DEBUG:stevedore.extension:found extension EntryPoint.parse('lib = papis.commands.explore:lib')
5049-DEBUG:stevedore.extension:found extension EntryPoint.parse('pick = papis.commands.explore:pick')
5070-DEBUG:stevedore.extension:found extension EntryPoint.parse('yaml = papis.yaml:explorer')
5103-DEBUG:stevedore.extension:found extension EntryPoint.parse('browse = papis.commands.browse:cli')
5124-DEBUG:stevedore.extension:found extension EntryPoint.parse('config = papis.commands.config:cli')
5147-DEBUG:stevedore.extension:found extension EntryPoint.parse('edit = papis.commands.edit:cli')
5169-DEBUG:stevedore.extension:found extension EntryPoint.parse('explore = papis.commands.explore:cli')
5194-DEBUG:stevedore.extension:found extension EntryPoint.parse('export = papis.commands.export:cli')
5229-DEBUG:stevedore.extension:found extension EntryPoint.parse('git = papis.commands.git:cli')
5281-DEBUG:stevedore.extension:found extension EntryPoint.parse('list = papis.commands.list:cli')
5327-DEBUG:stevedore.extension:found extension EntryPoint.parse('mv = papis.commands.mv:cli')
5380-DEBUG:stevedore.extension:found extension EntryPoint.parse('open = papis.commands.open:cli')
5430-DEBUG:stevedore.extension:found extension EntryPoint.parse('rename = papis.commands.rename:cli')
5482-DEBUG:stevedore.extension:found extension EntryPoint.parse('rm = papis.commands.rm:cli')
5533-DEBUG:stevedore.extension:found extension EntryPoint.parse('run = papis.commands.run:cli')
5576-DEBUG:stevedore.extension:found extension EntryPoint.parse('update = papis.commands.update:cli')
6202-DEBUG:db:whoosh:Initialized index found for library
6203-DEBUG:db:whoosh:Query string asdfasdfasdf
6293-DEBUG:db:whoosh:<Top 0 Results for Or([Term('title', 'asdfasdfasdf'), Term('author', 'asdfasdfasdf'), Term('tags', 'asdfasdfasdf')]) runtime=0.005870132998097688>
6296-DEBUG:document:sorting 0 documents
6297-WARNING:cli:list:No documents retrieved
6298-DEBUG:papis:plugin:creating manager for papis.picker
6299-DEBUG:stevedore.extension:found extension EntryPoint.parse('papis = papis.tui.picker:Picker')

Anyway, I'll leave it slow for now and let me know if you want me to run any more tests.

I should add that I do often change the yaml files under papis's feet, I have a script for setting the ref field --- but these times are not from a run directly after upating the yaml files.

Jackson

alejandrogallo commented 3 years ago

Yes for me it's quite clear what is the problem, but I need some time to implement the caching cleanly, and I don't know when I'll get around to do this, I can push the immediate fix of the timings I posted, the one where only the data stored in the whoosh database will be retrieved, and maybe in two weeks time I can have the other part ready, and then we can do a v0.12 release