quanteda / quanteda.corpora

A collection of corpora for quanteda
19 stars 5 forks source link

Find a data repository for corpora #3

Open koheiw opened 6 years ago

koheiw commented 6 years ago

The data storage is currently my Dropbox folder. It might be better to have a web server (or dedicated Dropbox account).

koheiw commented 6 years ago

Web server for this purpose should come with CDN. Candidates are

GitHub has a storage service https://git-lfs.github.com/ but not sure how it works.

koheiw commented 6 years ago

quanteda.org's Google Drive might suffice.

kbenoit commented 6 years ago

Good point - feel free to try to make that work.

koheiw commented 6 years ago

It seems to work if we modify the link a bit

quanteda.corpora::download(url = 'https://drive.google.com/uc?export=download&id=1VepIW420aAwIPxg4_Kj4Fi6jB-yllbGS')
koheiw commented 5 years ago

Zenodo seems to be a good open repository for corpora. Download from Dropbox is faster but it is free and gives corpora DOIs.

> system.time(
+ download.file(url = "https://zenodo.org/record/1010076/files/GlycosideHydrolase_BLASTP.tar.gz?download=1", 
+               destfile = tempfile())
+ )
trying URL 'https://zenodo.org/record/1010076/files/GlycosideHydrolase_BLASTP.tar.gz?download=1'
Content type 'application/octet-stream' length 38068097 bytes (36.3 MB)
==================================================
downloaded 36.3 MB

   user  system elapsed 
  0.749   0.514 100.665 
> 
> system.time(
+   download.file(url = "https://www.dropbox.com/s/631wdkr21cwh0ez/GlycosideHydrolase_BLASTP.tar.gz?dl=1", 
+                 destfile = tempfile())
+ )
trying URL 'https://www.dropbox.com/s/631wdkr21cwh0ez/GlycosideHydrolase_BLASTP.tar.gz?dl=1'
Content type 'application/binary' length 38068097 bytes (36.3 MB)
==================================================
downloaded 36.3 MB

   user  system elapsed 
  0.924   0.574  41.288 
> 
kbenoit commented 5 years ago

Nice, and as we discussed, the DOI feature is great too. However since .rda is already zipped, and since Zenodo is serving these as .zip files, this of course works too (and could from Zenodo, without the download.file):

load(url("https://kenbenoit.net/files/testcorpus.rda"))
testcorpus
## Corpus consisting of 58 documents and 3 docvars.