Closed rsaim closed 4 years ago
I had a hard time removing the pdf files from the git files. FTR, I used https://rtyley.github.io/bfg-repo-cleaner/ and rewrote the whole git history. Please clone your checkouts again. @tezas
@himanshuhy any progress here?
As tabula
operations take a lot of time (~2hrs for ~1300 pdfs). I am thinking about caching the results. I would put up the cache in the cloud once we have this setup.
We would really benefit if we could have shared disk space. Should consider something like AWS Could9?
Looking into it.
Google Drive - tried but a lot of setup (keys etc) DropBox -> Going with this one right now. Find details below. pCloud - doesnt have python sdk (only JS :D)
I have created an dropBox app - Supplementary and generated access tokens for this (will document how to use it etc)
https://www.dropbox.com/developers/documentation/python#tutorial for quick check on how to use it
@rsaim Yeah, we can use some free tier EC2 host to run cloud9 to perform the parsing. Let me get the code to upload and download files from dropbox into our app first.
We don't need computing power after parsing the pdf. I was thinking of having a disk path (over NFS?) so that I can dump the data, caches, etc.
I am thinking of a disk served by an NFS server. We could then mount a local path to refer to the disk and directly read/write. This is how infra is setup is many firms. However, we should do it only if it's easy to setup.
@himanshuhy this looks promising. I will explore more and update here.
I have uploaded pdf files to dropbox. I will update README with the usage of the dropbox API.
In [17]: len([entry.name for entry in dbx.files_list_folder('/data/dtu_results').entries])
Out[17]: 1319
In [18]: [entry.name for entry in dbx.files_list_folder('/data/dtu_results').entries][:5]
Out[18]:
['Con_MBA2K11_209.pdf',
'DIS_BT_656_657.pdf',
'CON_BT_681.pdf',
'E15_BT_DIS_VI_VIII_428.pdf',
'O17_REV_BTPT_1_756.pdf']
All the pdf, parsed data and caches are uploaded to dropbox.
Thanks to @himanshuhy
Let's move the files to S3 (or some other similar storage). This repo should contain only code. Please specify the steps taken and the way to access/download the pdfs.