Resume file upload - Githubissues

jrchudy commented 7 months ago

Using recordedit to upload files doesn't properly resume an upload if the connection to the server was lost or the window was refreshed. For instance, if a user is uploading a 200 MB file and only half of the file gets uploaded before an interruption, the user has to restart the upload process.

We should properly "resume" the file upload if a partial file exists on the server already. This will be handled in multiple steps:

when the connection to the server is interrupted but page is NOT reloaded
when the page is reloaded
resuming in a different tab/window

step 1 - resume on connection interruption

For resuming a file upload process that was interrupted in recordedit, the following should be done:

As a file is selected and the first chunk has completed uploading, start to track the last contiguous chunk that was uploaded. As each new chunk is uploaded, the last chunk index is updated if needed. Also keep track of the jobUrl to ensure it is the same “path” that is being uploaded to when the upload process is attempting to resume.
- lastChunkIdx - the index of the last chunk that was successfully uploaded
- jobUrl - the hatrac namespace with the upload job appended to the end
- fileSize - the size of the file initially uploaded to help ensure the resumed file is the same as the original
- uploadVersion - the final name for an upload job after the job is marked as complete
- the key in the map is intended to ensure each upload that is being resumed is for the same file (checksums match) being uploaded to the same column and recordedit form index
After an error occurs (loss of internet connection for example) and the user tries to upload again (clicks submit after resolving connection issue), the file checksum is calculated for the “new” upload
if that checksum, the associated column name, and recordedit form index all match one of our stored values, check the following before marking the current UploadFileObject to be a partial upload
- the new url is contained in the jobUrl we tracked
- the lastChunkIdx represents some chunks have been uploaded
- the new file's size is the same as fileSize we tracked
- the uploadVersion is not set yet
While checking if the file exists using the generated url (/hatrac/path/to/file.txt without ;upload/somehash), if we get a 409 response assume the namespace already exists and the jobUrl (/hatrac/path/to/file.txt;upload/somehash) is used for the upload instead of creating a new upload job
When starting the upload job, the lastChunkIdx is used to notify which chunk to start uploading from so the job is properly resumed and we don’t upload any duplicate chunks

Map for storing information about incomplete upload jobs
```
{
`${file.checksum}_${column.name}_${recordedit_form_index}`: {
  lastChunkIdx: n,
  jobUrl: '/hatrac/path/to/file.txt;upload/somehash',
  fileSize: n,
  uploadVersion: '/hatrac/path/to/file.txt:version'
}
}
```

step 2 - when the page is reloaded

Other changes to accomplish this across reloads include:

moving the map for storing information about incomplete upload jobs to local storage
ensure object is cleaned up when a job is complete before redirecting after submission

More information that should be stored:

catalog, schema, table, and shortest key

step 3 - resuming in a different tab/window

Other changes to accomplish this across multiple tabs/windows:

lock/coordination to ensure the job isn't being uploaded 2 from 2 different sources at the same time
- window ID and timestamps
- release lock after X time
  - check the timestamp info and calculate how long it has been since the lock was created

jrchudy commented 7 months ago

Looking at the hatrac REST-API doc, I see this line:

Note, there is no support for determining which chunks have or have not been uploaded as such tracking is not a requirement placed on Hatrac implementations.

jrchudy commented 4 months ago

Issue #1837 is related to this issue. 1837 resets the file upload job when a user logs back in after having their session expire. This would further improve that failure scenario but won't "fix" that issue. Ideally, to address 1837 we don't refresh the page after, but this issue will further improve that feature since other things can happen to refresh the page and force a restart.

jrchudy commented 1 week ago

Step 1 from the main message above has been merged. Moving this issue to "Scheduled" for implementing steps 2 and 3.

informatics-isi-edu / chaise

Resume file upload #2379

step 1 - resume on connection interruption

step 2 - when the page is reloaded

step 3 - resuming in a different tab/window