pombase / website

PomBase website v2
MIT License
6 stars 1 forks source link

Dataset versioning / ftp releases #2058

Open manulera opened 1 year ago

manulera commented 1 year ago

Started from discussion on https://github.com/pombase/website/issues/2042

Summary of what we discussed today

TODO

Extra

Do we want to mirror our future releases in Zenodo? It might be relatively easy.

Structure of release

What we want to include in a release (subject to change)

ValWood commented 1 year ago

This is the contents of the misc file with links to their download location from the website. https://docs.google.com/spreadsheets/d/1bMgK_AGCejZn6Y11BS4da0Dv8Z3LfppbcJxYRy6AQZs/edit#gid=0

Perhaps we should move these all as subdirectories of the main directory in case we eventually point people here instead of the "latest" directory

We also need to deal with the files in "exports". These would be classed as "Flies_for_external_resources".

ValWood commented 1 year ago
kimrutherford commented 11 months ago

As a small step, https://pombase.org/latest_release now redirects to the latest release. Currently this is manually configured in Apache config file and will need to be updated when there's a new release. It can be automated but I need to do more reading of the Apache documentation. :-)

kimrutherford commented 11 months ago

I've made a temporary directory on pombase.org with the new file structure so that we can track progress: "/public_releases"

Currently it only has the files and directories from the spreadsheet.

There are no links to it on the website so hopefully no one will find it. :-)

The URL is https://pombase.org/releases/ with "releases" replaced by "public_releases".

When we're happy and everything's done we can remove/rename the old /releases directory and move /public_releases over.

kimrutherford commented 11 months ago

We need to look at this issue again after the rearranging is done: https://github.com/pombase/pombase-chado/issues/1060

manulera commented 10 months ago

Hi @kimrutherford. I had a look at the directory structure. Some changes / additions I think make sense.

manulera commented 9 months ago

Keep me in the loop for this one! This should be part of the announcement mentioned in https://github.com/pombase/website/issues/2042

kimrutherford commented 7 months ago

what do we still need to do here?

Very little is done yet. I've only had a think about how to handle this. I haven't actually done anything yet.

kimrutherford commented 7 months ago

From pombase/pombase-chado#720:

Add a README describing the files and directories to each new release directory.

ValWood commented 3 months ago

@kimrutherford this would be a good one to address, so that we can annoucnde the "gene structure history" v

ValWood commented 1 week ago

Summary Creat new directory structure with currently named files Keep the old files but remove all reference to them

kimrutherford commented 2 days ago

I've finally got back to this.

Create orthologs directory (not sure what needs to be included).

@ValWood: Which files from pombe-embl/orthologs/ should be included in the release directories?

I've implemented most of the suggestions from @manulera except for the gft directory and this: "Create a file external_data_versions.md, with several sections:" I'm still working on those.

An example of the current progress is here: https://www.pombase.org/public_releases/pombase-2024-06-01/ (It's not linked to from the website)

I'd like to rename files to be consistent:

We should add a README in every directory describing the files.

Example of current file structure:

PomBase_release_notes_2024-06-01.txt
annotation_datasets
annotation_datasets/high_confidence_physical_interactions
annotation_datasets/high_confidence_physical_interactions/pombase-go-substrates.tsv.gz
annotation_datasets/high_confidence_physical_interactions/pombase-go-physical-interactions.tsv.gz
annotation_datasets/gene_expression_table.tsv
annotation_datasets/pombase-modifications.tsv.gz
annotation_datasets/Complex_annotation.tsv
annotation_datasets/gene_ontology
annotation_datasets/gene_ontology/gene_product_information_taxonid_4896.tsv
annotation_datasets/gene_ontology/pombase_style_gaf.tsv
annotation_datasets/gene_ontology/go_style_gaf.tsv
annotation_datasets/gene_ontology/gene_product_annotation_data_taxonid_4896.tsv
annotation_datasets/disease_association.tsv
annotation_datasets/protein_modifications
annotation_datasets/phenotypes_and_genotypes
annotation_datasets/phenotypes_and_genotypes/pombase-phenotype-annotation.eco.phaf.gz
annotation_datasets/phenotypes_and_genotypes/gene_viability.tsv
annotation_datasets/phenotypes_and_genotypes/all_alleles.tsv
annotation_datasets/phenotypes_and_genotypes/pombase-phenotype-annotation.phaf.gz
auto_generated_protein_feature_files
auto_generated_protein_feature_files/disordered_regions.tsv
auto_generated_protein_feature_files/PeptideStats.tsv
auto_generated_protein_feature_files/transmembrane_domain_coords_and_seqs.tsv
auto_generated_protein_feature_files/aa_composition.tsv
auto_generated_protein_feature_files/ProteinFeatures.tsv
chado_database
chado_database/pombase-chado-2024-06-01.sql.gz
chado_database/PomBase_chado_database_README.txt
exports_for_external_resources
exports_for_external_resources/rnacentral.json
exports_for_external_resources/allele_summaries.json
exports_for_external_resources/apicuron_data.json
exports_for_external_resources/publications_with_annotations.txt
genome_data
genome_data/fasta
genome_data/fasta/feature_sequences
genome_data/fasta/feature_sequences/five_prime_utrs.fa.gz
genome_data/fasta/feature_sequences/peptide.fa.gz
genome_data/fasta/feature_sequences/cds.fa.gz
genome_data/fasta/feature_sequences/cds+introns.fa.gz
genome_data/fasta/feature_sequences/cds+introns+utrs.fa.gz
genome_data/fasta/feature_sequences/three_prime_utrs.fa.gz
genome_data/fasta/feature_sequences/introns_within_cds.fa.gz
genome_data/fasta/chromosomes
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_mitochondrial_chromosome.fa.gz
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_chromosome_I.fa.gz
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_chromosome_III.fa.gz
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_chromosome_II.fa.gz
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_all_chromosomes.fa.gz
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_chr_II_telomeric_gap.fa.gz
genome_data/fasta/chromosomes/Schizosaccharomyces_pombe_mating_type_region.fa.gz
genome_data/feature_coordinate_files
genome_data/feature_coordinate_files/chromosome_2.exon.coords.tsv
genome_data/feature_coordinate_files/chromosome_3.cds.coords.tsv
genome_data/feature_coordinate_files/chromosome_2.cds.coords.tsv
genome_data/feature_coordinate_files/chr_II_telomeric_gap.exon.coords.tsv
genome_data/feature_coordinate_files/chr_II_telomeric_gap.gene.coords.tsv
genome_data/feature_coordinate_files/mitochondrial.cds.coords.tsv
genome_data/feature_coordinate_files/mating_type_region.cds.coords.tsv
genome_data/feature_coordinate_files/chr_II_telomeric_gap.cds.coords.tsv
genome_data/feature_coordinate_files/mating_type_region.exon.coords.tsv
genome_data/feature_coordinate_files/chromosome_1.gene.coords.tsv
genome_data/feature_coordinate_files/chromosome_2.gene.coords.tsv
genome_data/feature_coordinate_files/mating_type_region.gene.coords.tsv
genome_data/feature_coordinate_files/mitochondrial.exon.coords.tsv
genome_data/feature_coordinate_files/chromosome_1.cds.coords.tsv
genome_data/feature_coordinate_files/mitochondrial.gene.coords.tsv
genome_data/feature_coordinate_files/chromosome_1.exon.coords.tsv
genome_data/feature_coordinate_files/chromosome_3.exon.coords.tsv
genome_data/feature_coordinate_files/chromosome_3.gene.coords.tsv
genome_data/embl_files
genome_data/embl_files/mating_type_region.contig
genome_data/embl_files/chromosome3.contig
genome_data/embl_files/chromosome2.contig
genome_data/embl_files/pMIT.contig
genome_data/embl_files/chromosome1.contig
genome_data/embl_files/telomeric.contig
genome_data/gff
genome_data/gff/Schizosaccharomyces_pombe_all_chromosomes_unstranded.gff3
genome_data/gff/Schizosaccharomyces_pombe_mating_type_region.gff3
genome_data/gff/Schizosaccharomyces_pombe_chromosome_II.gff3
genome_data/gff/Schizosaccharomyces_pombe_chromosome_I.gff3
genome_data/gff/Schizosaccharomyces_pombe_all_chromosomes_reverse_strand.gff3
genome_data/gff/Schizosaccharomyces_pombe_chromosome_III.gff3
genome_data/gff/Schizosaccharomyces_pombe_all_chromosomes.gff3
genome_data/gff/Schizosaccharomyces_pombe_chr_II_telomeric_gap.gff3
genome_data/gff/Schizosaccharomyces_pombe_all_chromosomes_forward_strand.gff3
genome_data/gff/Schizosaccharomyces_pombe_mitochondrial_chromosome.gff3
miscellaneous
miscellaneous/increased_sensitivity_to_chemical.tsv.gz
miscellaneous/increased_resistance_to_chemical.tsv.gz
names_and_identifiers
names_and_identifiers/sysID2product.rna.tsv
names_and_identifiers/gene_IDs_names_products.tsv
names_and_identifiers/sysID2product.tsv
names_and_identifiers/gene_IDs_names.tsv
names_and_identifiers/pseudogeneIDs.tsv
names_and_identifiers/PomBase_names_and_identifiers_README.txt
orthologs
orthologs/conserved_one_to_one.txt
orthologs/compara_orths.tsv
orthologs/conserved_multi.txt
slim_terms
slim_terms/pombe_mondo_slim_ids_and_names.tsv
slim_terms/PomBase_slim_terms_README.txt
slim_terms/cc_goslim_pombe_ids_and_names.tsv
slim_terms/bp_goslim_pombe_ids_and_names.tsv
slim_terms/fypo_slim_ids_and_names.tsv
slim_terms/mf_goslim_pombe_ids_and_names.tsv
ValWood commented 2 days ago

top of the list to discuss tomorrow

kimrutherford commented 2 days ago

I've made a spreadsheet for comments: https://docs.google.com/spreadsheets/d/1ZEGjgMgfPMjH42fqKfhZ_RwiQL0KDY1w3QsyDydA_zQ/edit?gid=0#gid=0

Please add your suggestions for renaming directories.

Once the directory structure is looking good I'll make a similar spreadsheet for the file names.

kimrutherford commented 2 days ago

These files are current in a directory called "miscellaneous":

Can someone suggest a better location?

kimrutherford commented 2 days ago

The main release notes / README.txt is here if you'd like to add anything: https://github.com/pombase/pombase-scripts/blob/main/release_readme_files/release-README.txt

That gets copied to: https://www.pombase.org/public_releases/pombase-2024-06-01/PomBase_release_notes_2024-06-01.txt

kimrutherford commented 1 day ago

I've started adding READMEs in the sub-directories:

https://github.com/pombase/pombase-scripts/tree/main/release_readme_files

kimrutherford commented 1 day ago

I've made sure all files are uncompressed and have underscores instead of dashes in file names.

I've added more to the main README: https://www.pombase.org/public_releases/pombase-2024-06-01/PomBase_release_notes_2024-06-01.txt

But please add or edit if you have the motivation: https://github.com/pombase/pombase-scripts/blob/main/release_readme_files/release-README.txt (Pascal, you can edit on GitHub directly with the pencil icon at the top-right)

I've added a README file for each directory. Most are empty: https://github.com/pombase/pombase-scripts/tree/main/release_readme_files These can be edited on GitHub too. Please add to the READMEs when you think of things. Even notes and bullet-points are useful.

kimrutherford commented 1 day ago

Which files from pombe-embl/orthologs/ should be included in the release directories?

These are the current file names. Are they OK? Are there any ortholog data files missing? https://www.pombase.org/public_releases/pombase-2024-06-01/curated_orthologs/

This is the README: https://www.pombase.org/public_releases/pombase-2024-06-01/curated_orthologs/PomBase_curated_orthologs_README.txt

If you fancy doing any editing it's here on GitHub: https://github.com/pombase/pombase-scripts/blob/main/release_readme_files/curated_orthologs-README.txt

kimrutherford commented 1 day ago

Did we make a decision about which GAF file to keep? (Or both?)

The files are: go_style_gaf.tsv and pombase_style_gaf.tsv

The PomBase style GAF has these differences:

Which one (or both) should we pick?

Both files could do with better names.

We also have extended_pombase_style_gaf.tsv which has a product columns and a term name column to make them easier to use. See: https://github.com/pombase/pombase-chado/issues/1152#issuecomment-2010281643

Should we include that too?

(It also need a better file name since it's not GAF format)

ValWood commented 22 hours ago

Let's use the official style. We can document the differences between the official extension extensions and the pombe display names in the README The ND are useful in the file, although we don't display them in pombase for query reasons. We should include the qualifier column anyway

ValWood commented 22 hours ago

Let's keep the one with product and term names. It might be useful. Perhaps call it GO_gaf_plus_term_and_product_label.tsv it's a bit long but it is explicit.

kimrutherford commented 22 hours ago

Let's use the official style.

OK, thanks. That will simplify the README a bit.

We should include the qualifier column anyway

We include it in the pombase_style file it's either blank or "NOT".

Let's keep the one with product and term names. It might be useful.

I agree. Most of the world won't care if the file is GAF formatted or not.

ValWood commented 22 hours ago

We include it in the pombase_style file it's either blank or "NOT".

or contributes_to?

kimrutherford commented 21 hours ago

contributes_to

True! "colocalizes_with" is on a few annotations too.

ValWood commented 21 hours ago

The ortholog files are all present and correct and the names are fine. I'll review/edit the README'sat the end