legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore

Any of the file-containing directories can contain a README file and a CHANGES file.

README YAML files

Every file-containing directory, AKA "collection", in the LIS datastore should contain a README file in YAML format.

Filename: README.[collection].yml

Examples:

Validation

The basic README structure (acceptable field names, strings vs. lists vs. dates) can be validated using the following command:

ajv -s readme.schema.json -d README.[collection].yml --all-errors --coerce-types=array --remove-additional=all --changes

using the JSON schema definition readme.schema.json.

This schema must be kept up to date along with the sample template README.collection.yml when any changes are made to the README spec.

Content requirements

READMEs must be YAML-compliant, which means they pass the test on http://www.yamllint.com/ or using the yamllint command-line utility. Here are some, but not all, requirements for a valid LIS README:

Gotchas

MANIFEST files

A directory may contain a MANIFEST.collection.correspondence.yml file which lists the current filenames and prior filenames:

---
# filename in this repository: previous names
glyma.Wm82.gnm2.DTC4.genome_hardmasked.fna.gz: Gmax_275_v2.0.hardmasked.fa.gz
glyma.Wm82.gnm2.DTC4.genome_softmasked.fna.gz: Gmax_275_v2.0.softmasked.fa.gz

... and also a MANIFEST.collection.descriptions.yml file which briefly describes the files:

---
# filename in this repository: description
glyma.Wm82.gnm2.DTC4.hardmasked.fna.gz: Genome assembly: masked with 'N's
glyma.Wm82.gnm2.DTC4.softmasked.fna.gz: Genome assembly: masked with lowercase

CHANGES files

A directory may contain a CHANGES.collection.txt file which lists file transformations and changes. For example:

file transformations:

seqlen.awk vigan.Gyeongwon.a3.v1.cds.fa | perl -pe 's/(\w+\.\w+)\.(\d+) (\d+)/$1\t$2\t$3/' | sort -k1,1 -k3nr,3nr | top_line.awk | awk '{print ">" $1 "." $2}' | sort > tmp.longest"

fasta_to_zero_lines.awk vigan.Gyeongwon.a3.v1.cds.fa | sort > tmp.fa.1ln

join tmp.longest tmp.fa.1ln | perl -pe 's/ zqz /\n/' > vigan.Gyeongwon.gnm3.ann1.3Nz5c.cds_primaryTranscript.fna

seqlen.awk vigan.Gyeongwon.a3.v1.peptide.fa | perl -pe 's/(\w+\.\w+)\.(\d+) (\d+)/$1\t$2\t$3/' | sort -k1,1 -k3nr,3nr | top_line.awk | awk '{print ">" $1 "." $2}' | sort > tmp.longest

fasta_to_zero_lines.awk vigan.Gyeongwon.a3.v1.peptide.fa | sort > tmp.fa.1ln

join tmp.longest tmp.fa.1ln | perl -pe 's/ zqz /\n/' > vigan.Gyeongwon.gnm3.ann1.3Nz5f.protein_primaryTranscript.faa

changes: 

2018-03-03 Added MANIFEST files
2018-09-15 Changed fastas to include full prefixing (s/vigan/vigan.Gyeongwon.gnm3.ann1/)