Closed Don-Isdale closed 3 months ago
Worksheet name identifies the type and name of the dataset which it contains. The spreadsheet file names are arbitrary, not limited or interpreted by the Pretzel application.
Vertical bar |
is used as the separator between the worksheet dataset type label and the dataset name.
The dataset name appears after the label, e.g. 'Map| Red x Blue' (outside spaces will be trimmed)
There are 2 distinct worksheet types for SNP (which has a 1 bp position) and Alignment (which has Start / End), e.g. the alignment of probes (for SNPs) to a reference assembly.
These may contain comments : # comment (in column 1) The comments in the provided templates will guide the user through populating the spreadsheet with data, explaining the format and field types. URLs in the comments will refer to further explanation in the user guide, and to the folder in the github repository where the user may download templates to start a new file.
# Add columns for each dataset in Workbook
Field | Alignment\ | EST_SNP |
---|---|---|
commonName | Lentil | |
parentName | Lens_culinaris_2.0 | |
platform | SNP_OPA | |
shortName | SNP_OPA |
From | To |
---|---|
Lcu.2RBY.Chr1 | Lc1 |
Lcu.2RBY.Chr2 | Lc2 |
Lcu.2RBY.Chr3 | Lc3 |
Lcu.2RBY.Chr4 | Lc4 |
Lcu.2RBY.Chr5 | Lc5 |
Lcu.2RBY.Chr6 | Lc6 |
Lcu.2RBY.Chr7 | Lc7 |
Lcu.2RBY.unitig |
Marker | Chromosome | Position |
---|
Name | Chromosome | Position |
---|
Name | Chromosome | Start | End |
---|
Chromosome | Start | End |
---|
Additional columns other than these fields : values are placed in Feature .values.fieldName, or for Genome : Block.meta.fieldName
Additional worksheets which don't match the names defined above are ignored - this enables users to keep additional information and data preparation worksheets in the same file.
implementing the template format described in the above comment :
Testing QTL upload on dev v2.10.0+83031857:
Adding Start position, the file was uploaded successfully. Then attempting to load the dataset:
Testing QTL upload on dev v2.10.0+9ba3e3f6:
Issues noted above are fixed.
It was noted that if the QTLs in the Excel sheet are ordered by eg: trait, and different traits appear on the same chromosome, the result is multiple blocks created with the same scope.
Introduction
Accepting .xlsx files via upload will take Pretzel closer to the working environment of the users, and reduce friction for them.
This task will result in 1 or more spreadsheet templates which are designed to contain the information required for a Pretzel dataset upload. Example spreadsheet templates : genetic / linkage map, QTLs, SNP list.
The node.js server will call out to bash scripts which are already prototyped, which convert .xlsx -> CSV -> Pretzel JSON
The template will contain a worksheet with general instructions on use and guidelines for data formats and naming conventions. The column headers will guide the users to meet the format requirements.
It is important that format errors are reported back to the user and include information such as
The spreadsheet can contain multiple datasets, each in a single worksheet. (a future feature might enable 1 block per worksheet, or per-spreadsheet file using api/Dataset/blockFeaturesAdd to make it easier for users to handle larger datasets).
The first worksheet can contain metadata, possibly in a 2D table to contain the metadata for each worksheet/dataset. There may also be some metadata which is common to all the datasets in a file. A separate worksheet, possibly last, will contain guidance / documentation for the users. Data-checking functions can be built into the spreadsheet.
A common cause of difficulty for users is duplicate dataset names caused by repeated upload. This function should probably delete the existing dataset if the spreadsheet is uploaded again (using the API function which also deletes blocks and features). The API parameters could include a flag to enable this, which could be a checkbox in the upload GUI; users may sometimes want to not overwrite work they have done previously. - (for discussion) This will significantly streamline the process for experienced users also, because finding the correct dataset to delete in 'All Datasets' takes extra time before each upload, which may be repeated several times as the data is refined.
The existing file upload should suffice for .xlsx in the first implementation. A good option to follow that would be addition of drag & drop to the file upload, for a further improvement in UX flow. There are a number of options; a quick scan suggests github.com/adopted-ember-addons/ember-file-upload as a first choice. More options at emberObserver : file-upload and drag-and-drop.
In a later stage, an export function can also be added to output a Pretzel dataset in this Pretzel standard spreadsheet format. This could be used for data exchange between users of different Pretzel instances; JSON format could also be used, but the spreadsheet format is more familiar and useful for most people. This is probably easiest to do via an API, using similar tools (ssconvert, perl/jq).
These earlier notes, following, will be developed in discussion to refine the requirements and select the initial feature set.
(notes from 2021Mar20 :)
Overview of MVP
Options subsequent to MVP
Initial outline of sub-tasks
Later options
drag & drop
[x] add support for SNP List (da49ba4e)
add support for QTL, Genome ( 756f60e6 : add QTL import; this is a first pass, will evolve.)
more error-checking
add .zip support, also .gz.
[x] different metadata for worksheets within a file (cfb8eaf 3, also noted in later comment)
[x] [/4H]record the original Chromosome name in block.meta when applying 'Chromosome Renaming' (5e83c058)
Issues
deleteRecord
onCurrent Items
branch : feature/qtlUpload
1ad462f7 : use worksheetname for dataset if parentName given in Metadata instead of column
[x] [1-2H/1H] support parent name in Metadata worksheet as an alternative to parent column in QTL worksheet 1ad462f7 : QTL spreadsheet upload : support parentName in Metadata worksheet as an alternative to parentName column in QTL worksheet;
[x] [1H/0H] handle absence of 'Chromosome Renaming' worksheet Handled this sheet being empty in 82259290, and that has also addressed this item.
from testing 2021Sep14 8:30pm reported in slack :
[x] [1-2H/0.5H] QTL name if parentName column : dataset-parentName. 7c1ccfec
[x] [/1H] 5e1e2ca8: QTL spreadsheet upload : handle and report block.scope not matched in dataset.parent
[x] [/1H] a01e4712: spreadsheet upload : remove non-ascii chars on the outside of values
[x] [1-2H] report error in GUI if Metadata is not given for QTL worksheet
[x] [/5H] QTL spreadsheet upload : sort by parentName (if defined) then chr column. 445eab2a
[x] [/1H] QTL spreadsheet upload : when matching parent of QTL, exclude datasets which are copied from another server. 3a82877f (bc7d4a4a)
after 2.11.0
[x] [1-2H/1H] if the parentName is not in the db, report error to frontend GUI. ab401449
[x] [1-2H/3H] the citation is truncated in the Metadata worksheet (the citation column of the dataset worksheet is OK) 067719ab: QTL spreadsheet upload : don't split on comma within quoted cell value in the Metadata worksheet
[ ] [4-8H] upload : convert to ascii : unicode punctuation which has a ascii equivalent This is not limited to spreadsheet upload; can apply to CSV, table and JSON also, refn: https://stackoverflow.com/questions/4808967/replacing-unicode-punctuation-with-ascii-approximations
[1-2H/2.5H] sort of parentName failed when it was in a column further right, so there is a column id issue probably comma or punctuation in headers 0e10d224 : handle comma in colums prior to the parentName column
7bee8922 : QTL spreadsheet upload : avoid extra output to stdout which was obstructing the error message display