Open j-c-w opened 3 years ago
Hi!
Thank you for this comment.
Quite frankly I'm a little bit puzzled by your timings, for me, doing
time papis list -a | wc -l
gets me
962
real 0m0.411s
user 0m0.376s
sys 0m0.040s
and doing a filtering
time papis list -a year:1981 | wc -l
ends in
6
real 0m0.538s
user 0m0.492s
sys 0m0.134s
I do not wish to jump into hasty conclusions, but it kind of seems like you're not using the cache
time papis -s use-cache false list -a year:1981 | wc -l
results in
6
real 0m5.094s
user 0m31.306s
sys 0m0.290s
which would be coherent with your timings (kind of). I'd like to get to the bottom of this.
If needed be, I think a (rip)grep backend would be doable and useful. However I haven't come across the need for this.
Could you tell me what
papis config use-cache
gives?
So that you understand exactly how the papis backend works, it's almost deceptively simple, in principle I create the list of all documents which gets created by reading every yaml file in every document. The parsing of the yaml documents is slow, but this is only done at indexing time. When this list is there I serialize it and store it in
~/.cache/papis/database/e700d592d0806fcab1fc70d2b6beac96-papers
which is just a pickle
serialization of the list. Later, reading
the documents is just done by deserializing this file, and it runs fast.
That's why I can't understand your timings, it would only make sense
if your use-cache
is set to False
, however I might be wrong
and it would be interesting if other people share their experience.
Cheers!
Hi Alejandro,
Running
papis config use-cache
results in True
.
Here's my config:
[papers]
dir = ~/Documents/papers
[settings]
default-library = papers
database-backend = whoosh
whoosh-schema-fields = ['year', 'author', 'publisher', 'journal', 'pages',
'group', 'ref']
opentool = qpdfview --unique
opentool-txt = gvim
editor = vim
sort-field = time-added
bibtex-unicode = true
And here is the top of the profile if I run papis --profile prof list "year:2017" -a
1161597 function calls (1152383 primitive calls) in 7.053 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
39679 0.806 0.000 0.838 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/reader.py:99(forward)
236018 0.455 0.000 0.460 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/reader.py:87(peek)
3514 0.435 0.000 1.433 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:1270(scan_plain)
61426 0.365 0.000 0.745 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:145(need_more_tokens)
65489 0.315 0.000 0.315 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:279(stale_possible_simple_keys)
36647 0.289 0.000 4.573 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:113(check_token)
4878 0.221 0.000 0.866 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/parser.py:273(parse_node)
7799 0.217 0.000 0.676 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:1185(scan_flow_scalar_non_spaces)
32882 0.201 0.000 0.231 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/reader.py:94(prefix)
7942 0.199 0.000 3.666 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:156(fetch_more_tokens)
22296 0.166 0.000 0.261 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/reader.py:114(get_mark)
7076 0.129 0.000 0.322 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:1311(scan_plain_spaces)
57547 0.119 0.000 0.119 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:264(next_possible_simple_key)
4878/50 0.119 0.000 5.895 0.118 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/composer.py:63(compose_node)
15294 0.118 0.000 5.484 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/parser.py:94(check_event)
7942 0.117 0.000 0.283 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:752(scan_to_next_token)
7379 0.115 0.000 0.283 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:1228(scan_flow_scalar_spaces)
85003 0.114 0.000 0.125 0.000 {built-in method builtins.isinstance}
420 0.098 0.000 1.116 0.003 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:1134(scan_flow_scalar)
22296 0.096 0.000 0.096 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/error.py:6(__init__)
10794 0.082 0.000 0.234 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:135(get_token)
2028 0.075 0.000 0.175 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:545(fetch_value)
3934 0.075 0.000 0.201 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/composer.py:88(compose_scalar_node)
85761/85001 0.072 0.000 0.074 0.000 {built-in method builtins.len}
4878 0.067 0.000 0.131 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/constructor.py:59(construct_object)
79 0.066 0.001 0.066 0.001 {built-in method marshal.loads}
4878 0.066 0.000 0.105 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/resolver.py:143(resolve)
750/50 0.062 0.000 5.879 0.118 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/composer.py:117(compose_mapping_node)
58580 0.061 0.000 0.061 0.000 {method 'append' of 'list' objects}
551/164 0.061 0.000 0.163 0.001 /nix/store/r94aa2gj4drkhfvkm2p4ab6cblb6kxlq-python3-3.7.6/lib/python3.7/sre_parse.py:469(_parse)
4370 0.059 0.000 0.112 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:295(save_possible_simple_key)
463/400 0.058 0.000 0.094 0.000 {built-in method builtins.__build_class__}
9701 0.054 0.000 0.109 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:1416(scan_line_break)
3514 0.043 0.000 1.575 0.000 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/scanner.py:668(fetch_plain)
1739 0.039 0.000 0.941 0.001 /nix/store/m1lif1jpby7zsy3bgjqp8cjv5kiflydd-python3.7-PyYAML-5.2/lib/python3.7/site-packages/yaml/parser.py:427(parse_block_mapping_key)
Basically all the time is spent in PyYaml which is where I came to my conclusions from.
To contrast, I'm very surprised by your timing results! It does indeed suggest that the configuration I have is the root of the problem here.
ok! it does not make any sense that pyyaml is being called, since it should be only called at indexing time. I tested also with whoosh and I'm unable to reproduce your timings, it's weird. But do not despair, we'll get to the bottom of this!
I have two hunches
tell me the rights of your cache directory, i.e.
ls -Ral ~/.cache/papis
what it would make sense is that papis has no writing rights in ~/.cache/papis
and it's not writing the cache, so that the files never exist and it indexes every time.
Please try also with the papis
backend,
[papers]
database-backend = papis
and report your experience.
ls -Ral ~/.cache/papis
produces
total 4
drwxr-xr-x 3 jackson users 3 Mar 4 2020 .
drwxr-xr-x 28 jackson users 34 Jan 21 12:29 ..
drwxr-xr-x 3 jackson users 3 Mar 4 2020 database
/home/jackson/.cache/papis/database:
total 3
drwxr-xr-x 3 jackson users 3 Mar 4 2020 .
drwxr-xr-x 3 jackson users 3 Mar 4 2020 ..
drwxr-xr-x 5 jackson users 5 Feb 16 10:55 whoosh
/home/jackson/.cache/papis/database/whoosh:
total 6
drwxr-xr-x 5 jackson users 5 Feb 16 10:55 .
drwxr-xr-x 3 jackson users 3 Mar 4 2020 ..
drwxr-xr-x 2 jackson users 7 Feb 21 18:59 fba3ecf302b9ff695b11a01bb324e355-papers
/home/jackson/.cache/papis/database/whoosh/fba3ecf302b9ff695b11a01bb324e355-papers:
total 2008
drwxr-xr-x 2 jackson users 7 Feb 21 18:59 .
drwxr-xr-x 5 jackson users 5 Feb 16 10:55 ..
-rw-r--r-- 1 jackson users 68551 Feb 21 16:41 MAIN_332saomtlpw25ued.seg
-rw-r--r-- 1 jackson users 3927 Feb 21 18:59 _MAIN_48.toc
-rw-r--r-- 1 jackson users 8021 Feb 21 18:59 MAIN_5b2qho7gxppac0dw.seg
-rw-r--r-- 1 jackson users 1865720 Feb 16 10:55 MAIN_hkjolg3oapl3oa1d.seg
-rwxr-xr-x 1 jackson users 0 Feb 16 10:55 MAIN_WRITELOCK
Which seems OK to me? (Perhaps I am missing something?)
Setting database-backend = papis
drastically improves performance though --- that is quite interesting. Runtimes are 1.38s, 1.33s and 1.25s; still much slower than grep, but manageable.
I'm happy to call this addressed for the time being from my perspective, although there is something funny going on with whoosh here that likely should be fixed. Provided there's nothing obviously wrong with my permissions, I'll have a dig through the whoosh calls at some point soon and see what I find.
I have almost the exact same experience and same relevant configuration and output as @j-c-w for all of these diagnostic steps. I think that the whoosh DB is being created but not accessed later on. The times for use-cache false
are pretty similar. Below is a record of me deleting any existing cache and testing/timing a few relevant commands. The first call to papis generates the index. The next two are making same call but toggling the use-cache
flag. The last 2 calls are using the papis
DB.
$ rm -rf ~/.cache/papis
$ time papis list -a | wc -l
394
papis list -a 5.84s user 0.33s system 173% cpu 3.564 total
wc -l 0.00s user 0.00s system 0% cpu 3.564 total
$ time papis list -a | wc -l
394
papis list -a 2.42s user 0.15s system 98% cpu 2.611 total
wc -l 0.00s user 0.00s system 0% cpu 2.609 total
$ time papis -s use-cache false list -a | wc -l
394
papis -s use-cache false list -a 2.41s user 0.15s system 98% cpu 2.596 total
wc -l 0.00s user 0.00s system 0% cpu 2.595 total
# change backend to papis
$ time papis list -a | wc -l
INFO:db:cache:Indexing library, this might take a while
WARNING:cli:list:No documents retrieved
0
papis list -a 3.79s user 0.40s system 306% cpu 1.368 total
wc -l 0.00s user 0.00s system 0% cpu 1.367 total
$ time papis list -a | wc -l
WARNING:cli:list:No documents retrieved
0
papis list -a 0.64s user 0.20s system 97% cpu 0.869 total
wc -l 0.00s user 0.00s system 0% cpu 0.868 total
$ papis config use-cache
True
@j-c-w can you paste here a debug output of a search?
PAPIS_DEBUG=1 papis -v list asdfasdfasdf
I would like to also discard the multiprocessing problem, so I'll be cleaning this part of papis
this week so that we can also test the binary running serially or multicore like this. It might be also related to this.
However I don't think it's not using the whoosh
database since it's matching documents, and for this to happen
the whoosh
backend ought to at least work.
OK. So here's something very odd. I changed to the papis backend for a while, and then back to whoosh. I may have also deleted the database, I can't remember. Now that I'm back in whoosh, my timings are fast and manageable.
Running the same things as before time papis list -a 'year:2017'
now results in times of approx 1.60s.
Looking at the profile, it's still mostly yaml, but obviously much less overhead. Perhaps there is some issue with the database being corrupted that results in re-loading?
In any case, here's the (probably much less useful) output of PAPIS_DEBUG=1 papis -v list asdfasdfasdf
:
7-DEBUG:config:importing
389-DEBUG:config:Creating configuration
389-DEBUG:config:Getting config file /home/jackson/.config/papis/config
389-DEBUG:Configuration:Reading configuration from /home/jackson/.config/papis/config
392-DEBUG:utils:importing
393-DEBUG:api:importing
395-DEBUG:papis:plugin:creating manager for papis.command
400-DEBUG:stevedore.extension:found extension EntryPoint.parse('add = papis.commands.add:cli')
531-DEBUG:papis:plugin:creating manager for papis.importer
531-DEBUG:stevedore.extension:found extension EntryPoint.parse('arxiv = papis.arxiv:Importer')
547-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.bibtex:Importer')
552-DEBUG:stevedore.extension:found extension EntryPoint.parse('crossref = papis.crossref:FromCrossrefImporter')
568-DEBUG:crossref:importing
569-DEBUG:stevedore.extension:found extension EntryPoint.parse('doi = papis.crossref:Importer')
574-DEBUG:stevedore.extension:found extension EntryPoint.parse('folder = papis.commands.add:FromFolderImporter')
579-DEBUG:stevedore.extension:found extension EntryPoint.parse('lib = papis.commands.add:FromLibImporter')
585-DEBUG:stevedore.extension:found extension EntryPoint.parse('pdf2arxivid = papis.arxiv:ArxividFromPdfImporter')
590-DEBUG:stevedore.extension:found extension EntryPoint.parse('pdf2doi = papis.crossref:DoiFromPdfImporter')
595-DEBUG:stevedore.extension:found extension EntryPoint.parse('pmid = papis.pubmed:Importer')
600-DEBUG:stevedore.extension:found extension EntryPoint.parse('yaml = papis.yaml:Importer')
631-DEBUG:stevedore.extension:found extension EntryPoint.parse('addto = papis.commands.addto:cli')
636-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.commands.bibtex:cli')
642-DEBUG:papis:plugin:creating manager for papis.exporter
642-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.bibtex:exporter')
648-DEBUG:stevedore.extension:found extension EntryPoint.parse('json = papis.json:exporter')
654-DEBUG:stevedore.extension:found extension EntryPoint.parse('yaml = papis.yaml:exporter')
661-DEBUG:papis:plugin:creating manager for papis.explorer
661-DEBUG:stevedore.extension:found extension EntryPoint.parse('add = papis.commands.explore:add')
667-DEBUG:stevedore.extension:found extension EntryPoint.parse('arxiv = papis.arxiv:explorer')
673-DEBUG:stevedore.extension:found extension EntryPoint.parse('base = papis.base:explorer')
679-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.bibtex:explorer')
685-DEBUG:stevedore.extension:found extension EntryPoint.parse('citations = papis.commands.explore:citations')
691-DEBUG:stevedore.extension:found extension EntryPoint.parse('cmd = papis.commands.explore:cmd')
696-DEBUG:stevedore.extension:found extension EntryPoint.parse('crossref = papis.crossref:explorer')
701-DEBUG:stevedore.extension:found extension EntryPoint.parse('dissemin = papis.dissemin:explorer')
707-DEBUG:stevedore.extension:found extension EntryPoint.parse('export = papis.commands.export:explorer')
712-DEBUG:stevedore.extension:found extension EntryPoint.parse('isbn = papis.isbn:explorer')
733-DEBUG:stevedore.extension:found extension EntryPoint.parse('isbnplus = papis.isbnplus:explorer')
739-DEBUG:stevedore.extension:found extension EntryPoint.parse('json = papis.json:explorer')
744-DEBUG:stevedore.extension:found extension EntryPoint.parse('lib = papis.commands.explore:lib')
749-DEBUG:stevedore.extension:found extension EntryPoint.parse('pick = papis.commands.explore:pick')
754-DEBUG:stevedore.extension:found extension EntryPoint.parse('yaml = papis.yaml:explorer')
763-DEBUG:stevedore.extension:found extension EntryPoint.parse('browse = papis.commands.browse:cli')
769-DEBUG:stevedore.extension:found extension EntryPoint.parse('config = papis.commands.config:cli')
774-DEBUG:stevedore.extension:found extension EntryPoint.parse('edit = papis.commands.edit:cli')
780-DEBUG:stevedore.extension:found extension EntryPoint.parse('explore = papis.commands.explore:cli')
786-DEBUG:stevedore.extension:found extension EntryPoint.parse('export = papis.commands.export:cli')
791-DEBUG:stevedore.extension:found extension EntryPoint.parse('git = papis.commands.git:cli')
800-DEBUG:stevedore.extension:found extension EntryPoint.parse('list = papis.commands.list:cli')
806-DEBUG:stevedore.extension:found extension EntryPoint.parse('mv = papis.commands.mv:cli')
813-DEBUG:stevedore.extension:found extension EntryPoint.parse('open = papis.commands.open:cli')
819-DEBUG:stevedore.extension:found extension EntryPoint.parse('rename = papis.commands.rename:cli')
826-DEBUG:stevedore.extension:found extension EntryPoint.parse('rm = papis.commands.rm:cli')
833-DEBUG:stevedore.extension:found extension EntryPoint.parse('run = papis.commands.run:cli')
838-DEBUG:stevedore.extension:found extension EntryPoint.parse('update = papis.commands.update:cli')
929-DEBUG:db:whoosh:Initialized index found for library
929-DEBUG:db:whoosh:Query string asdfasdfasdf
945-DEBUG:db:whoosh:<Top 0 Results for Or([Term('title', 'asdfasdfasdf'), Term('author', 'asdfasdfasdf'), Term('tags', 'asdfasdfasdf')]) runtime=0.0009825390006881207>
946-DEBUG:document:sorting 0 documents
946-WARNING:cli:list:No documents retrieved
947-DEBUG:papis:plugin:creating manager for papis.picker
947-DEBUG:stevedore.extension:found extension EntryPoint.parse('papis = papis.tui.picker:Picker')
But I note that this was not a slow run by any means (0.82s). Perhaps @avonmoll can run the same test if his library is still slow? I expect papis to slow down on me here at some point, it usually does. I'll run that again when it is going slowly.
I have the same output as @j-c-w, but my whoosh times are still ~2.4s when I do a title:*
query, which grabs all ~400 documents. I re-ran the profiler and get output similar to @j-c-w's output, which shows almost all of the time is spent in YAML-related functions. I note that whoosh times are smaller when the query is more specific. Do I understand correctly that whoosh queries the DB and that once matches are found each match's YAML is then read from disk in order to populate the info displayed in the picker?
@avonmoll That seems to match my experience as well. Doing a specific query with the year gives
$ time papis list -a "year:2017" | wc -l
41
papis list -a "year:2017" 0.59s user 0.03s system 99% cpu 0.618 total,
so reasonably swift. However, doing a catch all query for the title is very slow
$ time papis list -a "title:*" | wc -l
1330
papis list -a "title:*" 9.00s user 0.08s system 99% cpu 9.092 total
Like you said, after doing the whoosh query, it starts loading yaml files in databases/whoosh.py::query to get the full info. Maybe papis can store enough information in the db to construct a dummy document and only read the full info.yaml
when an exact match is selected in the picker?
Ok, I more or less have sorted it out, I want to ask you guys what do you prefer, since I do not use whoosh.
So the problem is indeed the conversion to yaml, I've made several tests and now I reproduce your issues.
When we do
papis list -a title:*
in my case then whoosh does
documents = [
papis.document.from_folder(r.get(Database.get_id_key()))
for r in results]
as you all have pointed out. This means that yaml will be parsed, in my case for 900 documents, which is absolutely not the way to go, since yaml parsing in python is regrettably slow.
Remember I have 950 documents in my library, so
these my timings for papis list -a title:*
using whoosh
Implementation | Time (s) |
---|---|
v0.11.1 |
10.6 ± 0.320 |
papis.utils.parmap |
3.545 ± 0.046 |
caching whoosh entry | 0.6688 ± 0.0117 |
Caching the whoosh entry means that what will be shown in the picker has only the information that is used for indexing the whoosh documents, so the whoosh fields, which I guess is enough, and judging from the timings it is well worth it.
papis.utils.parmap
is simply parsing paralelly the yaml files,
which in my opinion is just bound to be very slow in
HDDs and it's not the way to go to touch yaml parsing.
I let you whoosh users decide which one you prefer. Maybe I was not clear in some points so feel free to ask!
Here you see the results of the benchmarks taken with hyperfine
v0.11.1
with the implementation
documents = [
papis.document.from_folder(r.get(Database.get_id_key()))
for r in results]
and the timings
Benchmark #1: papis list -a author:* | wc -l
Time (mean ± σ): 10.606 s ± 0.320 s [User: 10.551 s, System: 0.059 s]
Range (min … max): 10.225 s … 11.059 s 10 runs
papis.utils.parmap
with the implementation
folders = [r.get(Database.get_id_key()) for r in results]
documents = papis.utils.parmap(papis.document.from_folder, folders)
and the timings
Benchmark #1: papis list -a author:* | wc -l
Time (mean ± σ): 3.545 s ± 0.046 s [User: 18.936 s, System: 0.148 s]
Range (min … max): 3.458 s … 3.576 s 10 runs
caching the whoosh entry with the implementation
documents = [
papis.document.from_folder_data_folder_no_autoload(
r.get(Database.get_id_key()), r)
for r in results]
and the timings
Benchmark #1: papis list -a author:* | wc -l
Time (mean ± σ): 668.8 ms ± 11.7 ms [User: 632.8 ms, System: 40.7 ms]
Range (min … max): 652.1 ms … 689.3 ms 10 runs
However on a second note, this is also not reall a solution because then in commands like
papis export --format bibtex
then only the cached information will be exported, and doing a loading of the document for whoosh before exporting
is not really general... I will try to cache the dictionary data at indexing time in whoosh just like in the papis
database, and then we can retrieve deserialize it when getting the matches, anyways we now know where te problem lies.
OK, the poor performance has returned, so I'll run the tests you asked for Alejandro. That said, I think the idea you talk about above, with caching the dictionary data is great, and I do think that it solves the larger (non-buggy) performance problem.
Anyway:
time papis list -a 'year:2017' | wc -l
Results in: 50 papis list -a 'year:2017' 7.22s user 0.29s system 99% cpu 7.515 total wc -l 0.00s user 0.00s system 0% cpu 7.515 total
So we're back to being very slow again. Running with PAPIS_DEBUG:
time PAPIS_DEBUG=1 papis -v list asdfasdfasdf
Results in: 6.72s user 0.50s system 99% cpu 7.287 total (so a bit faster, but still /very/ slow for a query that matches no items).
64-DEBUG:config:importing
3288-DEBUG:config:Creating configuration
3291-DEBUG:config:Getting config file /home/jackson/.config/papis/config
3292-DEBUG:Configuration:Reading configuration from /home/jackson/.config/papis/config
3306-DEBUG:utils:importing
3311-DEBUG:api:importing
3321-DEBUG:papis:plugin:creating manager for papis.command
3343-DEBUG:stevedore.extension:found extension EntryPoint.parse('add = papis.commands.add:cli')
4051-DEBUG:papis:plugin:creating manager for papis.importer
4052-DEBUG:stevedore.extension:found extension EntryPoint.parse('arxiv = papis.arxiv:Importer')
4117-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.bibtex:Importer')
4137-DEBUG:stevedore.extension:found extension EntryPoint.parse('crossref = papis.crossref:FromCrossrefImporter')
4206-DEBUG:crossref:importing
4209-DEBUG:stevedore.extension:found extension EntryPoint.parse('doi = papis.crossref:Importer')
4230-DEBUG:stevedore.extension:found extension EntryPoint.parse('folder = papis.commands.add:FromFolderImporter')
4251-DEBUG:stevedore.extension:found extension EntryPoint.parse('lib = papis.commands.add:FromLibImporter')
4274-DEBUG:stevedore.extension:found extension EntryPoint.parse('pdf2arxivid = papis.arxiv:ArxividFromPdfImporter')
4296-DEBUG:stevedore.extension:found extension EntryPoint.parse('pdf2doi = papis.crossref:DoiFromPdfImporter')
4317-DEBUG:stevedore.extension:found extension EntryPoint.parse('pmid = papis.pubmed:Importer')
4339-DEBUG:stevedore.extension:found extension EntryPoint.parse('yaml = papis.yaml:Importer')
4484-DEBUG:stevedore.extension:found extension EntryPoint.parse('addto = papis.commands.addto:cli')
4513-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.commands.bibtex:cli')
4544-DEBUG:papis:plugin:creating manager for papis.exporter
4545-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.bibtex:exporter')
4570-DEBUG:stevedore.extension:found extension EntryPoint.parse('json = papis.json:exporter')
4597-DEBUG:stevedore.extension:found extension EntryPoint.parse('yaml = papis.yaml:exporter')
4627-DEBUG:papis:plugin:creating manager for papis.explorer
4628-DEBUG:stevedore.extension:found extension EntryPoint.parse('add = papis.commands.explore:add')
4653-DEBUG:stevedore.extension:found extension EntryPoint.parse('arxiv = papis.arxiv:explorer')
4680-DEBUG:stevedore.extension:found extension EntryPoint.parse('base = papis.base:explorer')
4708-DEBUG:stevedore.extension:found extension EntryPoint.parse('bibtex = papis.bibtex:explorer')
4733-DEBUG:stevedore.extension:found extension EntryPoint.parse('citations = papis.commands.explore:citations')
4759-DEBUG:stevedore.extension:found extension EntryPoint.parse('cmd = papis.commands.explore:cmd')
4784-DEBUG:stevedore.extension:found extension EntryPoint.parse('crossref = papis.crossref:explorer')
4809-DEBUG:stevedore.extension:found extension EntryPoint.parse('dissemin = papis.dissemin:explorer')
4837-DEBUG:stevedore.extension:found extension EntryPoint.parse('export = papis.commands.export:explorer')
4862-DEBUG:stevedore.extension:found extension EntryPoint.parse('isbn = papis.isbn:explorer')
4966-DEBUG:stevedore.extension:found extension EntryPoint.parse('isbnplus = papis.isbnplus:explorer')
4995-DEBUG:stevedore.extension:found extension EntryPoint.parse('json = papis.json:explorer')
5022-DEBUG:stevedore.extension:found extension EntryPoint.parse('lib = papis.commands.explore:lib')
5049-DEBUG:stevedore.extension:found extension EntryPoint.parse('pick = papis.commands.explore:pick')
5070-DEBUG:stevedore.extension:found extension EntryPoint.parse('yaml = papis.yaml:explorer')
5103-DEBUG:stevedore.extension:found extension EntryPoint.parse('browse = papis.commands.browse:cli')
5124-DEBUG:stevedore.extension:found extension EntryPoint.parse('config = papis.commands.config:cli')
5147-DEBUG:stevedore.extension:found extension EntryPoint.parse('edit = papis.commands.edit:cli')
5169-DEBUG:stevedore.extension:found extension EntryPoint.parse('explore = papis.commands.explore:cli')
5194-DEBUG:stevedore.extension:found extension EntryPoint.parse('export = papis.commands.export:cli')
5229-DEBUG:stevedore.extension:found extension EntryPoint.parse('git = papis.commands.git:cli')
5281-DEBUG:stevedore.extension:found extension EntryPoint.parse('list = papis.commands.list:cli')
5327-DEBUG:stevedore.extension:found extension EntryPoint.parse('mv = papis.commands.mv:cli')
5380-DEBUG:stevedore.extension:found extension EntryPoint.parse('open = papis.commands.open:cli')
5430-DEBUG:stevedore.extension:found extension EntryPoint.parse('rename = papis.commands.rename:cli')
5482-DEBUG:stevedore.extension:found extension EntryPoint.parse('rm = papis.commands.rm:cli')
5533-DEBUG:stevedore.extension:found extension EntryPoint.parse('run = papis.commands.run:cli')
5576-DEBUG:stevedore.extension:found extension EntryPoint.parse('update = papis.commands.update:cli')
6202-DEBUG:db:whoosh:Initialized index found for library
6203-DEBUG:db:whoosh:Query string asdfasdfasdf
6293-DEBUG:db:whoosh:<Top 0 Results for Or([Term('title', 'asdfasdfasdf'), Term('author', 'asdfasdfasdf'), Term('tags', 'asdfasdfasdf')]) runtime=0.005870132998097688>
6296-DEBUG:document:sorting 0 documents
6297-WARNING:cli:list:No documents retrieved
6298-DEBUG:papis:plugin:creating manager for papis.picker
6299-DEBUG:stevedore.extension:found extension EntryPoint.parse('papis = papis.tui.picker:Picker')
Anyway, I'll leave it slow for now and let me know if you want me to run any more tests.
I should add that I do often change the yaml files under papis's feet, I have a script for setting the ref
field --- but these times are not from a run directly after upating the yaml files.
Jackson
Yes for me it's quite clear what is the problem, but I need some time to implement the caching cleanly, and I don't know when I'll get around to do this, I can push the immediate fix of the timings I posted, the one where only the data stored in the whoosh database will be retrieved, and maybe in two weeks time I can have the other part ready, and then we can do a v0.12
release
Hi all,
As my library size grows, I have noticed papis slowing down considerably. I have previously looked into making papis faster, but have come up short on ideas for how to fix the issue. Looking at performance profiles, the obvious way forward seems to be parallelization, but Alejandro has already tried that, and my own experiments with multiprocessing confirm what he found in (I can't find the issue number but I know I have seen it, anyway, Python multiprocessing doesn't seem to work to solve the problem).
The response times of papis are on the very slow end for me right now, I'm looking at 10+ second response times, which is getting to be too much. I have been using whoosh for some time, but it doesn't seem to skip the challenging part of this for papis, which is the file loading.
Now, I also want to point out that more performance should be possible. In my library of 737 items, running:
time papis list -a "year:2017"
takes 11.99s, 11.80, 12.16Grep performs much better (on a much harder task as it also scans PDF files!):
time grep -nr '2017' ./*
takes 5.47, 5.72, 6.05sExcluding PDF files from grep:
time grep -nr '2017' ./*.yaml
takes 0.11s, 0.13s, 0.11srip-grep performs better yet:
time rg 2017
0.31s, 0.28s, 0.29So clearly, the disk speed is not the limiting factor. Perhaps someone with an understanding of Python file-handling can suggest some fixes? Or at least why it is slow and multiprocessing doesn't seem to help?
I'm happy to have a look at implementing ideas that seem sensible.
I do have some ideas to skirt this problem without addressing the real problem at hand (i.e. the slow file loading):
cat */*.yaml > all.yaml
takes approx 0.2s.)I'm keen to hear other suggestions, and thoughts on the options I've suggested above, because this performance issue is starting to get a little bit out of hand in my library.
Jackson