tomay / rebioma

Automatically exported from code.google.com/p/rebioma
0 stars 0 forks source link

Large file uploads #336

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
File size is currently effectively limited to about 5M or roughly 10k DwC 
records. This is too restrictive to be scalable - users cannot be expected 
to divide up their datasets, especially when dataset metadata gets 
incorporated into the system, at which point datasets need to remain 
intact. Therefore it would be useful to implement an incremental loading 
mechanism that can be used both in the current file upload via the user 
interface and in a managed csv harvest from registered providers. 
The incremental upload should be able to upload and process files of 
arbitrary size. If the file is large, the user should be notified via their 
registered email address when processing is complete, and the message 
should include the upload result information that one now sees in the user 
interface upon successful upload.

Original issue reported on code.google.com by gtuco.bt...@gmail.com on 6 Apr 2009 at 4:06

GoogleCodeExporter commented 9 years ago
i would suggest a simple command line tool that users can download to bulkload 
large
files. the tool would essentially divide and conquer a large local csv file by
posting chunks of it to a rebioma web service. the tool would require rebioma 
user
credentials, a chunk size, and a URL endpoint. this method is more scalable 
since
downloading large files from the cloud will be even more restrictive.

Original comment by eightyst...@gmail.com on 6 Apr 2009 at 4:53

GoogleCodeExporter commented 9 years ago
With the new Darwin Core's DatasetID term we could solve any relationships 
between 
files and maintain their coherence once loaded, should the user choose to do so.
Your recommendation, however, skirts the issue of provider registration and 
places 
the burden on the provider, again, to do something special to participate. THAT 
is 
not scalable. Providers need to be able to say "Here I am, come and get my 
data" 
without having a special hoop to jump through for every initiative (portal) in 
which 
they want to participate. This has to be a consideration.

Original comment by gtuco.bt...@gmail.com on 6 Apr 2009 at 5:17

GoogleCodeExporter commented 9 years ago
good point. so providers use the CsvProvider software to generate CSV dumps 
which can
then be retrieved using a web service, right? suppose CsvProvider could be 
modified
to optionally provide CSV download via pagination.

Original comment by eightyst...@gmail.com on 6 Apr 2009 at 5:25

GoogleCodeExporter commented 9 years ago
Yes. And an additional service could also be built as a middle-man to providers 
who 
don't have a software installation, but instead just have a file accessible via 
a 
URL. This middle-man software could grab the whole file and perform the chunked 
upload.

Original comment by gtuco.bt...@gmail.com on 6 Apr 2009 at 5:34

GoogleCodeExporter commented 9 years ago
CsvProvider modification helps providers but doesn't help single users. the
middle-man service requires maintaining a server somewhere outside of the cloud.
backing up a bit, i still think the best solution is a command line bulkloading 
tool.
it helps providers who can just automate it using a cron job, it help users who 
can
run it from their machine, and it avoids maintaining servers outside the cloud. 

Original comment by eightyst...@gmail.com on 6 Apr 2009 at 5:42

GoogleCodeExporter commented 9 years ago
this needs additional input from Aaron and John

Original comment by tom.alln...@gmail.com on 17 Feb 2011 at 1:24

GoogleCodeExporter commented 9 years ago

Original comment by tom.alln...@gmail.com on 17 Feb 2011 at 1:25

GoogleCodeExporter commented 9 years ago
some recent feedback from Aaron:

uploading large files to the Rebioma server (say larger than
50MB) over HTTP is going to be problematic. I still think the right
solution is a command line bulk loading tool. Another approach is to
support FTP uploads to the server with a task queue that can process
files and email people when it's done.

An interim solution (pending time and funding to develop a more complete fix) 
is to add some text (to the upload process) warning users not to upload files > 
50 MB, or to break these up to avoid issues. Users could also be notified that 
they may directly with us (contact: rebiomawebportal [at] gmail [dot] com) to 
get large files onto the system.

Original comment by tom.alln...@gmail.com on 8 Mar 2011 at 8:14

GoogleCodeExporter commented 9 years ago
A recent observation: We are able to upload about 20k records at a time, with 
about 10 fields. A command line or ftp tool would still be a useful add-on

Original comment by tom.alln...@gmail.com on 26 Jul 2011 at 4:48

GoogleCodeExporter commented 9 years ago
an even simpler solution to this would be a command line tool that runs on the 
server. Project administrators could upload files via ftp, then run the command 
line to ingest the data. Paired with issue #406, ownership of these records 
could be changed, or even assigned to any user with this same tool. 

Original comment by tom.alln...@gmail.com on 8 Aug 2011 at 11:36

GoogleCodeExporter commented 9 years ago

Original comment by ajmrakot...@gmail.com on 20 Oct 2011 at 7:53

GoogleCodeExporter commented 9 years ago
Wilfried is working on this, almost implemented for files < 25 MB. For larger 
files, this is an issue with JVM, and users will have to break files up if 
larger than 25 MB.

Original comment by tom.alln...@gmail.com on 29 Oct 2012 at 9:54

GoogleCodeExporter commented 9 years ago
Wilfried fixed this issue!

Original comment by nirina.t...@gmail.com on 23 Jan 2013 at 12:35