plantinformatics / pretzel

Javascript full-stack framework for Big Data visualisation and analysis
GNU General Public License v3.0
42 stars 13 forks source link

support for .xlsx upload #233

Closed Don-Isdale closed 3 months ago

Don-Isdale commented 3 years ago

Introduction

Accepting .xlsx files via upload will take Pretzel closer to the working environment of the users, and reduce friction for them.

This task will result in 1 or more spreadsheet templates which are designed to contain the information required for a Pretzel dataset upload. Example spreadsheet templates : genetic / linkage map, QTLs, SNP list.

The node.js server will call out to bash scripts which are already prototyped, which convert .xlsx -> CSV -> Pretzel JSON

The template will contain a worksheet with general instructions on use and guidelines for data formats and naming conventions. The column headers will guide the users to meet the format requirements.

It is important that format errors are reported back to the user and include information such as

The spreadsheet can contain multiple datasets, each in a single worksheet. (a future feature might enable 1 block per worksheet, or per-spreadsheet file using api/Dataset/blockFeaturesAdd to make it easier for users to handle larger datasets).

The first worksheet can contain metadata, possibly in a 2D table to contain the metadata for each worksheet/dataset. There may also be some metadata which is common to all the datasets in a file. A separate worksheet, possibly last, will contain guidance / documentation for the users. Data-checking functions can be built into the spreadsheet.

A common cause of difficulty for users is duplicate dataset names caused by repeated upload. This function should probably delete the existing dataset if the spreadsheet is uploaded again (using the API function which also deletes blocks and features). The API parameters could include a flag to enable this, which could be a checkbox in the upload GUI; users may sometimes want to not overwrite work they have done previously. - (for discussion) This will significantly streamline the process for experienced users also, because finding the correct dataset to delete in 'All Datasets' takes extra time before each upload, which may be repeated several times as the data is refined.

The existing file upload should suffice for .xlsx in the first implementation. A good option to follow that would be addition of drag & drop to the file upload, for a further improvement in UX flow. There are a number of options; a quick scan suggests github.com/adopted-ember-addons/ember-file-upload as a first choice. More options at emberObserver : file-upload and drag-and-drop.

In a later stage, an export function can also be added to output a Pretzel dataset in this Pretzel standard spreadsheet format. This could be used for data exchange between users of different Pretzel instances; JSON format could also be used, but the spreadsheet format is more familiar and useful for most people. This is probably easiest to do via an API, using similar tools (ssconvert, perl/jq).

These earlier notes, following, will be developed in discussion to refine the requirements and select the initial feature set.


(notes from 2021Mar20 :)

-   add some spreadsheet templates for GMs etc :
file name : LinkageMap_....xlsx     (optional : project, date, Variety1xVariety2 or Variety1_Variety2_Variety3_...)
first worksheet : metadata
worksheet : Variety1 x Variety2     (The worksheet name is used for the Dataset name)
meta    name    value (try json in MS Excel)
name    chr pos
Marker  Chromosome  Position
Chromosome  Marker  Position
No outside " required; there may be spaces within names. The script will strip off outer quotes and spaces.
There may be punctuation in Marker and Chromosome names.

Marker : names should match names used in other datasets, so that they can be aligned.
Chromosome : leading 'chr' is not required and will be stripped off.
Position : will be interpreted as a number.

The script assumes that Markers / SNPs for a single chromosome are grouped together, not intermingled.
The order in which chromosomes are given is the order they will be displayed in Pretzel in the Dataset Explorer (left panel).
Within a chromosome,  Markers / SNPs may be unsorted; they will be displayed on the Pretzel axis in Position order.

Example of using the script

Overview of MVP

Options subsequent to MVP


Initial outline of sub-tasks

Later options


Issues



Current Items

branch : feature/qtlUpload

1ad462f7 : use worksheetname for dataset if parentName given in Metadata instead of column

from testing 2021Sep14 8:30pm reported in slack :


after 2.11.0

7bee8922 : QTL spreadsheet upload : avoid extra output to stdout which was obstructing the error message display

Don-Isdale commented 3 years ago

Spreadsheet Template Format

Worksheet name identifies the type and name of the dataset which it contains. The spreadsheet file names are arbitrary, not limited or interpreted by the Pretzel application.


Dataset worksheets :

Vertical bar | is used as the separator between the worksheet dataset type label and the dataset name. The dataset name appears after the label, e.g. 'Map| Red x Blue' (outside spaces will be trimmed)

There are 2 distinct worksheet types for SNP (which has a 1 bp position) and Alignment (which has Start / End), e.g. the alignment of probes (for SNPs) to a reference assembly.

Non-dataset worksheets :

These may contain comments : # comment (in column 1) The comments in the provided templates will guide the user through populating the spreadsheet with data, explaining the format and field types. URLs in the comments will refer to further explanation in the user guide, and to the folder in the github repository where the user may download templates to start a new file.


Example worksheet 'Metadata' :

# Add columns for each dataset in Workbook

Field Alignment\ EST_SNP
commonName Lentil
parentName Lens_culinaris_2.0
platform SNP_OPA
shortName SNP_OPA

Example worksheet 'Chromosome Renaming'

From To
Lcu.2RBY.Chr1 Lc1
Lcu.2RBY.Chr2 Lc2
Lcu.2RBY.Chr3 Lc3
Lcu.2RBY.Chr4 Lc4
Lcu.2RBY.Chr5 Lc5
Lcu.2RBY.Chr6 Lc6
Lcu.2RBY.Chr7 Lc7

Example worksheet 'Chromosomes to Omit'

Lcu.2RBY.unitig

Column names of the dataset worksheet types :

Marker Chromosome Position
Name Chromosome Position
Name Chromosome Start End
Chromosome Start End

Additional columns other than these fields : values are placed in Feature .values.fieldName, or for Genome : Block.meta.fieldName


Additional worksheets which don't match the names defined above are ignored - this enables users to keep additional information and data preparation worksheets in the same file.

Don-Isdale commented 3 years ago

implementing the template format described in the above comment :

Don-Isdale commented 3 years ago

issues reported in testing :

gabrielkg commented 3 years ago

Testing QTL upload on dev v2.10.0+83031857:

Adding Start position, the file was uploaded successfully. Then attempting to load the dataset:

gabrielkg commented 2 years ago

Testing QTL upload on dev v2.10.0+9ba3e3f6:

Issues noted above are fixed.

It was noted that if the QTLs in the Excel sheet are ordered by eg: trait, and different traits appear on the same chromosome, the result is multiple blocks created with the same scope.