Specifications for directory naming, file naming, file contents in the LIS datastore
Any of the file-containing directories can contain a README file and a CHANGES file.
Every file-containing directory, AKA "collection", in the LIS datastore should contain a README file in YAML format.
Filename: README.[collection].yml
Examples:
The basic README structure (acceptable field names, strings vs. lists vs. dates) can be validated using the following command:
ajv -s readme.schema.json -d README.[collection].yml --all-errors --coerce-types=array --remove-additional=all --changes
using the JSON schema definition readme.schema.json.
This schema must be kept up to date along with the sample template README.collection.yml when any changes are made to the README spec.
READMEs must be YAML-compliant, which means they pass the test on http://www.yamllint.com/ or using the yamllint
command-line utility. Here are some, but not all, requirements for a valid LIS README:
identifier
at the top repeats the name of the collection, i.e. the name of the containing directory.synopsis
should be short, 100 characters or less.genotype
is a YAML array: but use a single "strain1 x strain2" value for bi-parental crosses.publication_doi
(and any other DOI) is a DOI, not a URL (e.g. 10.1534/g3.118.200521).publication_doi
is REQUIRED. If the data were generated by LIS, use the default LIS publication:
publication_doi: 10.1093/nar/gkv1159
A directory may contain a MANIFEST.collection.correspondence.yml file which lists the current filenames and prior filenames:
---
# filename in this repository: previous names
glyma.Wm82.gnm2.DTC4.genome_hardmasked.fna.gz: Gmax_275_v2.0.hardmasked.fa.gz
glyma.Wm82.gnm2.DTC4.genome_softmasked.fna.gz: Gmax_275_v2.0.softmasked.fa.gz
... and also a MANIFEST.collection.descriptions.yml file which briefly describes the files:
---
# filename in this repository: description
glyma.Wm82.gnm2.DTC4.hardmasked.fna.gz: Genome assembly: masked with 'N's
glyma.Wm82.gnm2.DTC4.softmasked.fna.gz: Genome assembly: masked with lowercase
A directory may contain a CHANGES.collection.txt file which lists file transformations and changes. For example:
file transformations:
seqlen.awk vigan.Gyeongwon.a3.v1.cds.fa | perl -pe 's/(\w+\.\w+)\.(\d+) (\d+)/$1\t$2\t$3/' | sort -k1,1 -k3nr,3nr | top_line.awk | awk '{print ">" $1 "." $2}' | sort > tmp.longest"
fasta_to_zero_lines.awk vigan.Gyeongwon.a3.v1.cds.fa | sort > tmp.fa.1ln
join tmp.longest tmp.fa.1ln | perl -pe 's/ zqz /\n/' > vigan.Gyeongwon.gnm3.ann1.3Nz5c.cds_primaryTranscript.fna
seqlen.awk vigan.Gyeongwon.a3.v1.peptide.fa | perl -pe 's/(\w+\.\w+)\.(\d+) (\d+)/$1\t$2\t$3/' | sort -k1,1 -k3nr,3nr | top_line.awk | awk '{print ">" $1 "." $2}' | sort > tmp.longest
fasta_to_zero_lines.awk vigan.Gyeongwon.a3.v1.peptide.fa | sort > tmp.fa.1ln
join tmp.longest tmp.fa.1ln | perl -pe 's/ zqz /\n/' > vigan.Gyeongwon.gnm3.ann1.3Nz5f.protein_primaryTranscript.faa
changes:
2018-03-03 Added MANIFEST files
2018-09-15 Changed fastas to include full prefixing (s/vigan/vigan.Gyeongwon.gnm3.ann1/)