Handle Data Upload - Githubissues

Overview

The data page should be adjusted and a new dropdown added where a user will be prompted to upload a csv/xlsx which will then be sent to the blob storage.
Fields to be specified during upload include: Preferred email address for communication, checkbox if DOI is to be generated or not, capture DOI given if any, and a short description of the dataset.
Once data is uploaded to blob storage, an email is sent to the uploader and the reviewer.
There has to be an interface where the uploader can track the progress of his/her uploaded data.

Request table --- to log user's data upload requests and follow-up status

user_id : the uploader id
file_url : the url of the file stored in the blobstorage
status: the status of the data uploading process (submitted - in review - reviewed - approved)
message_ids: message associated with request
assign_to_id: reviewer to whom the reviewer is assigned

API endpoints

post new request
update request status

Data upload workflow

1. Upload

User logs into vector atlas
Fields to be specified during upload
- Preferred email address
- Checkbox if DOI is to be generated
- Short description of the dataset
- Attach csv/excel file
Once data is uploaded to blob storage, an email is sent to the uploader and the reviewer
Track status of approval

2. Data alignment/approval

Alignment of data is to be performed offline
When alignment/approval process begins, notify uploader on the progress
If there are issues with the dataset, send an email to uploader asking for rectification
Once data is approved, data is ingested into Vector Atlas, DOI is generated only if the uploader requested for the DOI. An email is sent containing the doi notifying the user about success

3. Data correction

Users will reupload excel data with corrections

4. Communication

Link email communications with a dataset
Messages composed and replied in-app

Implications:

Create the following interfaces
- Form for new data upload
- Form for update of corrected data
- List view of submissions
- Form to view status of approval including communication (emails)
- Form for reviewer to update on progress and write comments
- List view of datasets a review is assigned
- Form for reviewer to compose an email message

During our weekly stand-up on 2024-09-20, the following was agreed:

Every data that is uploaded to VA must have an associated DOI as it then becomes publicly accessible data that will eventually be accessed from GBIF
Data that is uploaded can have an already existing DOI. In a case where the DOI is not available at the point of upload, a DOI shall be minted automatically at the point of ingestion of such data after final approval
A summary of the Data upload workflow process is as follows:

An uploader logs into the frontend and navigates to upload dataset page
The uploader will attach a dataset and provide a short description of the data
The system will notify all reviewers that a new dataset has been uploaded
A reviewer will download such uploaded dataset from the system and consequently self-assign to review the dataset
If the reviewer identifies issues with the uploaded dataset, they will write an email to the uploader asking the uploader to make corrections to the dataset
The uploader will then re-upload a new dataset with the respective corrections and the reviewer will pick up this dataset from the backend for continuation of the review
After the reviewer has completed the review process and judged the dataset to be valid, the reviewer will notify his manager to approve the dataset via an email notification automatically

Notes:

A review manager can make comments and suggestions for the reviewed data for a reviewer to act upon
Every upload process is accompanied by a validation step automatically
When a dataset is re-uploaded, only the reviewer who was working on the data shall be notified
When a dataset is uploaded for the first time, all reviewers will be notified-

Implementation

Four tables (models) will be required to support data upload and approval workflow

1. Uploaded Dataset model

Model to hold uploaded files including other meta data. Uploaded CSV files will be stored on disk (blob storage). Table fields are:

last_upload_date helps to record date when an upload was done.
last_status_update_date field records when the status of the uploaded dataset was lastly modified.
title - Title of the dataset. This will be used when minting DOI
description - Brief description of the dataset that may be of interest to a reviewer
uploaded_file_name. Name of the file that has been uploaded. We will use this name to retrieve the file from disk
converted_file_name. Name of the file that has been converted into VA template. We will use this name to retrieve the file from disk during ingestion stage
provided_doi. DOI provided at time of upload if it exists
status. Status of the uploaded dataset. Possible status are Pending, Approved, Under Review and Rejected

2. Uploaded Dataset Log model

Model to hold different activities that can be performed against an uploaded dataset e.g upload, re-upload, rejection, email communication, approval or rejection. Table fields are:

action_type. Type of action that was performed. Possible values are Upload, Download, Communication, Approve, Reject
action_date. Date when action occurred
action_details. Details of the occurring action
dataset. Uploaded dataset against which we are keeping a log
action_taker. Who performed the action?

3. DOI Source model

Model to store data/information that may result into minting/storing of a DOI. The fields are:

source_type. Where did the DOI request originate from. Possible values are Download and Upload
download_meta_data. Metadata of dataset that was downloaded for which we intend to mint a DOI
approval_status. Approval status of the DOI request
title. Title of the DOI source. This will be used at the point of generation of actual DOI
author_name. Name of the author/originator/requester
author_email. Email of the author/originator/requester. This is mandatory as we will use this email to communicate with the author as the DOI is going through the approval process
uploaded_dataset. Uploaded Dataset foreign key. Applicable where source_type is Upload
approved_dataset. Approved Dataset foreign key. Applicable where source_type is Upload

4. Communication Log model

Model to hold communication against an entity including an uploaded dataset. This model will give us a generic means of recording all form of communication in the system. Model fields are:

communication_date. Date of communication
channel_type. Channel of communication e.g Email
recipients. Recipients of the communication
message_type. Type or subject of message being communicated
message. Message to be communicated
sent_status. Sent status of the message. Possible values are Pending, Sent, Failed
sent_date. Date the message was sent
reference_entity_type. Type of entity that triggered this communication
reference_entity_name. Name or Id of the entity that triggered this communication
error_type. Type of error that occurred during sending of the message
error_description. Details of error that occurred during sending of the message
arguments. Arguments or extra data passed during sending of the message

When data has been uploaded, the reviewer can perform the below actions:
1. Contact them with a message that their data needs improvement as it does not meet VA datasets
2. Contact the user asking them to provide more information
Provide option to match uploaded data columns with the VA template columns

icipe-official / vectoratlas-software-code

Handle Data Upload #487

Implementation

1. Uploaded Dataset model

2. Uploaded Dataset Log model

3. DOI Source model

4. Communication Log model