Store static files like pdfs, parsed data, caches to a shared disk

rsaim / supplementary

Portal to analyze and visualize results of DTU students.

1 stars 0 forks source link

Store static files like pdfs, parsed data, caches to a shared disk #2

Closed rsaim closed 4 years ago

rsaim commented 4 years ago

Let's move the files to S3 (or some other similar storage). This repo should contain only code. Please specify the steps taken and the way to access/download the pdfs.

rsaim commented 4 years ago

I had a hard time removing the pdf files from the git files. FTR, I used https://rtyley.github.io/bfg-repo-cleaner/ and rewrote the whole git history. Please clone your checkouts again. @tezas

rsaim commented 4 years ago

@himanshuhy any progress here?

rsaim commented 4 years ago

As tabula operations take a lot of time (~2hrs for ~1300 pdfs). I am thinking about caching the results. I would put up the cache in the cloud once we have this setup.

We would really benefit if we could have shared disk space. Should consider something like AWS Could9?

himanshuhy commented 4 years ago

Looking into it.

Google Drive - tried but a lot of setup (keys etc) DropBox -> Going with this one right now. Find details below. pCloud - doesnt have python sdk (only JS :D)

I have created an dropBox app - Supplementary and generated access tokens for this (will document how to use it etc)

https://www.dropbox.com/developers/documentation/python#tutorial for quick check on how to use it

himanshuhy commented 4 years ago

@rsaim Yeah, we can use some free tier EC2 host to run cloud9 to perform the parsing. Let me get the code to upload and download files from dropbox into our app first.

rsaim commented 4 years ago

We don't need computing power after parsing the pdf. I was thinking of having a disk path (over NFS?) so that I can dump the data, caches, etc.

I am thinking of a disk served by an NFS server. We could then mount a local path to refer to the disk and directly read/write. This is how infra is setup is many firms. However, we should do it only if it's easy to setup.

rsaim commented 4 years ago

@himanshuhy this looks promising. I will explore more and update here.

rsaim commented 4 years ago

I have uploaded pdf files to dropbox. I will update README with the usage of the dropbox API.

In [17]: len([entry.name for entry in dbx.files_list_folder('/data/dtu_results').entries])
Out[17]: 1319

In [18]: [entry.name for entry in dbx.files_list_folder('/data/dtu_results').entries][:5]
Out[18]:
['Con_MBA2K11_209.pdf',
 'DIS_BT_656_657.pdf',
 'CON_BT_681.pdf',
 'E15_BT_DIS_VI_VIII_428.pdf',
 'O17_REV_BTPT_1_756.pdf']

rsaim commented 4 years ago

All the pdf, parsed data and caches are uploaded to dropbox.

Thanks to @himanshuhy