rsaim / supplementary

Portal to analyze and visualize results of DTU students.
1 stars 0 forks source link

Create disk caches for tabula and pdfplumber #8

Closed rsaim closed 4 years ago

rsaim commented 4 years ago

I will create the caches and move them to the dropbox disk space @himanshuhy recently created.

I will also document here how to leverage the cache during our development.

rsaim commented 4 years ago

https://github.com/grantjenks/python-diskcache looks promising.

rsaim commented 4 years ago

I have written code for creating the caches.

In [6]: from src.python.utils import get_topdir, tabula_read_pdf

In [7]: %time pages_df = tabula_read_pdf(filepath, pages="all")
CPU times: user 271 ms, sys: 10.1 ms, total: 281 ms
Wall time: 14.1 s

In [8]: %time pages_df = tabula_read_pdf(filepath, pages="all")
CPU times: user 25.4 ms, sys: 379 µs, total: 25.7 ms
Wall time: 25.9 ms

Please note that we will need to run dropbox_updown.py every time we want to update the caches. You will see prompts like the following.

python src/python/dropbox_updown.py data data
...
Folder listing failed for /data/caches -- assumed empty: ApiError('a2aa74fe1c110c729bf256ef84284799', ListFolderError('path', LookupError('not_found', None)))
Descending into caches ...
Descend into pdfplumber? [Y/n] y
Keeping directory: pdfplumber
Descend into tabula? [Y/n] y
Keeping directory: tabula
Total elapsed time for list_folder: 0.369
Folder listing failed for /data/caches/pdfplumber -- assumed empty: ApiError('bd36b779b43873b583f64533a792bd07', ListFolderError('path', LookupError('not_found', None)))
Descending into caches/pdfplumber ...
Upload cache.db-shm? [Y/n] y
Total elapsed time for upload 32768 bytes: 1.284
uploaded as b'cache.db-shm'
Upload cache.db-wal? [Y/n] y
Total elapsed time for upload 0 bytes: 0.835
uploaded as b'cache.db-wal'
Upload cache.db? [Y/n] y
Total elapsed time for upload 32768 bytes: 1.116
uploaded as b'cache.db'
Total elapsed time for list_folder: 0.449
Folder listing failed for /data/caches/tabula -- assumed empty: ApiError('909712773f0f95de36306fdb8390a4b8', ListFolderError('path', LookupError('not_found', None)))
Descending into caches/tabula ...
Upload cache.db-shm? [Y/n] y
Total elapsed time for upload 32768 bytes: 1.141
uploaded as b'cache.db-shm'
Upload cache.db-wal? [Y/n] y
Total elapsed time for upload 20632 bytes: 1.042
uploaded as b'cache.db-wal'
Upload cache.db? [Y/n] y
Total elapsed time for upload 32768 bytes: 0.996
uploaded as b'cache.db'
Descend into c4? [Y/n] y
Keeping directory: c4
Total elapsed time for list_folder: 0.548
Folder listing failed for /data/caches/tabula/c4 -- assumed empty: ApiError('94deba2765ba00e096f38664293faa62', ListFolderError('path', LookupError('not_found', None)))
Descending into caches/tabula/c4 ...
Descend into 3d? [Y/n] y
Keeping directory: 3d
Total elapsed time for list_folder: 0.355
Folder listing failed for /data/caches/tabula/c4/3d -- assumed empty: ApiError('3a3b3b23fd36a07e48c90fb683000700', ListFolderError('path', LookupError('not_found', None)))
Descending into caches/tabula/c4/3d ...
Upload f708da9d484a64783c739e16f2d4.val? [Y/n] y
Total elapsed time for upload 173529 bytes: 1.550
uploaded as b'f708da9d484a64783c739e16f2d4.val'
rsaim commented 4 years ago

The caches are created and pushed to dropbox.

Time has improved from 1-2hrs to just 19.2s :) for parsing all the pdfs using 8 cores (Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz).

In [3]: from src.python.parse_results import parse_dtu_result_pdf, parse_metadata, parse_all_pdf

In [4]: %time res=parse_all_pdf(parallel=True)
...
CPU times: user 758 ms, sys: 169 ms, total: 927 ms
Wall time: 19.2 s

Now we can focus on writing logic instead of waiting for output from the pdf parsing.