Open manulera opened 1 year ago
This is the contents of the misc file with links to their download location from the website. https://docs.google.com/spreadsheets/d/1bMgK_AGCejZn6Y11BS4da0Dv8Z3LfppbcJxYRy6AQZs/edit#gid=0
Perhaps we should move these all as subdirectories of the main directory in case we eventually point people here instead of the "latest" directory
We also need to deal with the files in "exports". These would be classed as "Flies_for_external_resources".
As a small step, https://pombase.org/latest_release now redirects to the latest release. Currently this is manually configured in Apache config file and will need to be updated when there's a new release. It can be automated but I need to do more reading of the Apache documentation. :-)
I've made a temporary directory on pombase.org with the new file structure so that we can track progress: "/public_releases"
Currently it only has the files and directories from the spreadsheet.
There are no links to it on the website so hopefully no one will find it. :-)
The URL is https://pombase.org/releases/
with "releases" replaced by "public_releases".
When we're happy and everything's done we can remove/rename the old /releases directory and move /public_releases over.
We need to look at this issue again after the rearranging is done: https://github.com/pombase/pombase-chado/issues/1060
Hi @kimrutherford. I had a look at the directory structure. Some changes / additions I think make sense.
genome_data
:
feature_coordinate_files
into this directory.embl_files
. Include .contig
embl files there.fasta
, gff
, identical to those in https://pombase.org/releases/pombase-2023-08-02/gtf
, including the file produced by https://github.com/manulera/pombase_gtf_file (should be run nightly using docker)annotation_datasets
:
orthologs
directory (not sure what needs to be included).external_data_versions.md
, with several sections:
ontologies
dir, like the one in the current release, but with that I don't think it's necessary.Keep me in the loop for this one! This should be part of the announcement mentioned in https://github.com/pombase/website/issues/2042
what do we still need to do here?
Very little is done yet. I've only had a think about how to handle this. I haven't actually done anything yet.
From pombase/pombase-chado#720:
Add a README describing the files and directories to each new release directory.
@kimrutherford this would be a good one to address, so that we can annoucnde the "gene structure history" v
Summary Creat new directory structure with currently named files Keep the old files but remove all reference to them
I've finally got back to this.
Create orthologs directory (not sure what needs to be included).
@ValWood: Which files from pombe-embl/orthologs/
should be included in the release directories?
I've implemented most of the suggestions from @manulera except for the gft
directory and this: "Create a file external_data_versions.md, with several sections:"
I'm still working on those.
An example of the current progress is here: https://www.pombase.org/public_releases/pombase-2024-06-01/ (It's not linked to from the website)
I'd like to rename files to be consistent:
.gz
vs not .gz
We should add a README in every directory describing the files.
Example of current file structure:
PomBase_release_notes_2024-06-01.txt
annotation_datasets
annotation_datasets/high_confidence_physical_interactions
annotation_datasets/high_confidence_physical_interactions/pombase-go-substrates.tsv.gz
annotation_datasets/high_confidence_physical_interactions/pombase-go-physical-interactions.tsv.gz
annotation_datasets/gene_expression_table.tsv
annotation_datasets/pombase-modifications.tsv.gz
annotation_datasets/Complex_annotation.tsv
annotation_datasets/gene_ontology
annotation_datasets/gene_ontology/gene_product_information_taxonid_4896.tsv
annotation_datasets/gene_ontology/pombase_style_gaf.tsv
annotation_datasets/gene_ontology/go_style_gaf.tsv
annotation_datasets/gene_ontology/gene_product_annotation_data_taxonid_4896.tsv
annotation_datasets/disease_association.tsv
annotation_datasets/protein_modifications
annotation_datasets/phenotypes_and_genotypes
annotation_datasets/phenotypes_and_genotypes/pombase-phenotype-annotation.eco.phaf.gz
annotation_datasets/phenotypes_and_genotypes/gene_viability.tsv
annotation_datasets/phenotypes_and_genotypes/all_alleles.tsv
annotation_datasets/phenotypes_and_genotypes/pombase-phenotype-annotation.phaf.gz
auto_generated_protein_feature_files
auto_generated_protein_feature_files/disordered_regions.tsv
auto_generated_protein_feature_files/PeptideStats.tsv
auto_generated_protein_feature_files/transmembrane_domain_coords_and_seqs.tsv
auto_generated_protein_feature_files/aa_composition.tsv
auto_generated_protein_feature_files/ProteinFeatures.tsv
chado_database
chado_database/pombase-chado-2024-06-01.sql.gz
chado_database/PomBase_chado_database_README.txt
exports_for_external_resources
exports_for_external_resources/rnacentral.json
exports_for_external_resources/allele_summaries.json
exports_for_external_resources/apicuron_data.json
exports_for_external_resources/publications_with_annotations.txt
genome_data
genome_data/fasta
genome_data/fasta/feature_sequences
genome_data/fasta/feature_sequences/five_prime_utrs.fa.gz
genome_data/fasta/feature_sequences/peptide.fa.gz
genome_data/fasta/feature_sequences/cds.fa.gz
genome_data/fasta/feature_sequences/cds+introns.fa.gz
genome_data/fasta/feature_sequences/cds+introns+utrs.fa.gz
genome_data/fasta/feature_sequences/three_prime_utrs.fa.gz
genome_data/fasta/feature_sequences/introns_within_cds.fa.gz
genome_data/fasta/chromosomes
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_mitochondrial_chromosome.fa.gz
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_chromosome_I.fa.gz
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_chromosome_III.fa.gz
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_chromosome_II.fa.gz
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_all_chromosomes.fa.gz
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_chr_II_telomeric_gap.fa.gz
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_mating_type_region.fa.gz
genome_data/feature_coordinate_files
genome_data/feature_coordinate_files/chromosome_2.exon.coords.tsv
genome_data/feature_coordinate_files/chromosome_3.cds.coords.tsv
genome_data/feature_coordinate_files/chromosome_2.cds.coords.tsv
genome_data/feature_coordinate_files/chr_II_telomeric_gap.exon.coords.tsv
genome_data/feature_coordinate_files/chr_II_telomeric_gap.gene.coords.tsv
genome_data/feature_coordinate_files/mitochondrial.cds.coords.tsv
genome_data/feature_coordinate_files/mating_type_region.cds.coords.tsv
genome_data/feature_coordinate_files/chr_II_telomeric_gap.cds.coords.tsv
genome_data/feature_coordinate_files/mating_type_region.exon.coords.tsv
genome_data/feature_coordinate_files/chromosome_1.gene.coords.tsv
genome_data/feature_coordinate_files/chromosome_2.gene.coords.tsv
genome_data/feature_coordinate_files/mating_type_region.gene.coords.tsv
genome_data/feature_coordinate_files/mitochondrial.exon.coords.tsv
genome_data/feature_coordinate_files/chromosome_1.cds.coords.tsv
genome_data/feature_coordinate_files/mitochondrial.gene.coords.tsv
genome_data/feature_coordinate_files/chromosome_1.exon.coords.tsv
genome_data/feature_coordinate_files/chromosome_3.exon.coords.tsv
genome_data/feature_coordinate_files/chromosome_3.gene.coords.tsv
genome_data/embl_files
genome_data/embl_files/mating_type_region.contig
genome_data/embl_files/chromosome3.contig
genome_data/embl_files/chromosome2.contig
genome_data/embl_files/pMIT.contig
genome_data/embl_files/chromosome1.contig
genome_data/embl_files/telomeric.contig
genome_data/gff
genome_data/gff/Schizosaccharomyces_pombe_all_chromosomes_unstranded.gff3
genome_data/gff/Schizosaccharomyces_pombe_mating_type_region.gff3
genome_data/gff/Schizosaccharomyces_pombe_chromosome_II.gff3
genome_data/gff/Schizosaccharomyces_pombe_chromosome_I.gff3
genome_data/gff/Schizosaccharomyces_pombe_all_chromosomes_reverse_strand.gff3
genome_data/gff/Schizosaccharomyces_pombe_chromosome_III.gff3
genome_data/gff/Schizosaccharomyces_pombe_all_chromosomes.gff3
genome_data/gff/Schizosaccharomyces_pombe_chr_II_telomeric_gap.gff3
genome_data/gff/Schizosaccharomyces_pombe_all_chromosomes_forward_strand.gff3
genome_data/gff/Schizosaccharomyces_pombe_mitochondrial_chromosome.gff3
miscellaneous
miscellaneous/increased_sensitivity_to_chemical.tsv.gz
miscellaneous/increased_resistance_to_chemical.tsv.gz
names_and_identifiers
names_and_identifiers/sysID2product.rna.tsv
names_and_identifiers/gene_IDs_names_products.tsv
names_and_identifiers/sysID2product.tsv
names_and_identifiers/gene_IDs_names.tsv
names_and_identifiers/pseudogeneIDs.tsv
names_and_identifiers/PomBase_names_and_identifiers_README.txt
orthologs
orthologs/conserved_one_to_one.txt
orthologs/compara_orths.tsv
orthologs/conserved_multi.txt
slim_terms
slim_terms/pombe_mondo_slim_ids_and_names.tsv
slim_terms/PomBase_slim_terms_README.txt
slim_terms/cc_goslim_pombe_ids_and_names.tsv
slim_terms/bp_goslim_pombe_ids_and_names.tsv
slim_terms/fypo_slim_ids_and_names.tsv
slim_terms/mf_goslim_pombe_ids_and_names.tsv
top of the list to discuss tomorrow
I've made a spreadsheet for comments: https://docs.google.com/spreadsheets/d/1ZEGjgMgfPMjH42fqKfhZ_RwiQL0KDY1w3QsyDydA_zQ/edit?gid=0#gid=0
Please add your suggestions for renaming directories.
Once the directory structure is looking good I'll make a similar spreadsheet for the file names.
These files are current in a directory called "miscellaneous":
Can someone suggest a better location?
The main release notes / README.txt is here if you'd like to add anything: https://github.com/pombase/pombase-scripts/blob/main/release_readme_files/release-README.txt
That gets copied to: https://www.pombase.org/public_releases/pombase-2024-06-01/PomBase_release_notes_2024-06-01.txt
I've started adding READMEs in the sub-directories:
https://github.com/pombase/pombase-scripts/tree/main/release_readme_files
I've made sure all files are uncompressed and have underscores instead of dashes in file names.
I've added more to the main README: https://www.pombase.org/public_releases/pombase-2024-06-01/PomBase_release_notes_2024-06-01.txt
But please add or edit if you have the motivation: https://github.com/pombase/pombase-scripts/blob/main/release_readme_files/release-README.txt (Pascal, you can edit on GitHub directly with the pencil icon at the top-right)
I've added a README file for each directory. Most are empty: https://github.com/pombase/pombase-scripts/tree/main/release_readme_files These can be edited on GitHub too. Please add to the READMEs when you think of things. Even notes and bullet-points are useful.
Which files from pombe-embl/orthologs/ should be included in the release directories?
These are the current file names. Are they OK? Are there any ortholog data files missing? https://www.pombase.org/public_releases/pombase-2024-06-01/curated_orthologs/
This is the README: https://www.pombase.org/public_releases/pombase-2024-06-01/curated_orthologs/PomBase_curated_orthologs_README.txt
If you fancy doing any editing it's here on GitHub: https://github.com/pombase/pombase-scripts/blob/main/release_readme_files/curated_orthologs-README.txt
Did we make a decision about which GAF file to keep? (Or both?)
The files are: go_style_gaf.tsv and pombase_style_gaf.tsv
The PomBase style GAF has these differences:
Which one (or both) should we pick?
Both files could do with better names.
We also have extended_pombase_style_gaf.tsv which has a product columns and a term name column to make them easier to use. See: https://github.com/pombase/pombase-chado/issues/1152#issuecomment-2010281643
Should we include that too?
(It also need a better file name since it's not GAF format)
Let's use the official style. We can document the differences between the official extension extensions and the pombe display names in the README The ND are useful in the file, although we don't display them in pombase for query reasons. We should include the qualifier column anyway
Let's keep the one with product and term names. It might be useful. Perhaps call it GO_gaf_plus_term_and_product_label.tsv it's a bit long but it is explicit.
Let's use the official style.
OK, thanks. That will simplify the README a bit.
We should include the qualifier column anyway
We include it in the pombase_style file it's either blank or "NOT".
Let's keep the one with product and term names. It might be useful.
I agree. Most of the world won't care if the file is GAF formatted or not.
We include it in the pombase_style file it's either blank or "NOT".
or contributes_to?
contributes_to
True! "colocalizes_with" is on a few annotations too.
The ortholog files are all present and correct and the names are fine. I'll review/edit the README'sat the end
Started from discussion on https://github.com/pombase/website/issues/2042
Summary of what we discussed today
Cons of current system:
data/annotations
anddata/releases/*
).Pros of current system:
What we want to include in a release (to be extended):
TODO
Extra
Do we want to mirror our future releases in Zenodo? It might be relatively easy.
Structure of release
What we want to include in a release (subject to change)