Final touches to changelog before announcing

manulera commented 1 year ago

Hi @kimrutherford @ValWood,

I was drafting the email to announce the changes, but I realised a few things that perhaps we should address:

[x] A lot of "new genes" lacking references: both protein and rna. What I was doing before to assign PMIDs was to use that are manually annotated in tsv files (see wiki), but that leaves most new genes without a PMID. I am thinking it would make more sense that in those that are empty we simply add all the PMIDs in the db_xref qualifiers (better too much info than none?)
[x] Removed and merged genes also mostly lack references. I don't think there is an easy way to automate this. The [protein] ones are a bit better(https://www.pombase.org/status/removed-protein-coding-genes), but the rna don't have many.
[ ] In the coordinate changes page, maybe it's good to also add the "genome snapshot" column like in gene pages "Gene structure history". In this case, instead of writing "File" for all of them maybe we could write svn:revision-number or ftp:ftp-folder, e.g. svn:2342, ftp:20110204.
- [ ] Also add a link to the full list on top of the page since this only includes the main features.
[ ] Genome sequence: We have the sequence updates page. This page and the pending changes page do not capture a change in the pMIT contig that happened in 2020 from svn:5949 to svn:6244. This change has substitutions, and some insertion or deletion so that the number of bases changes. A full list of genome sequence changes is generated by the pipeline and can be found here
[ ] I think we should also take the occasion to give guidelines on how to refer to genome versions, both for annotations and genome sequence.

We could discuss this next week if we have call.

ValWood commented 1 year ago

For these,

SPBC1198.04c.1	zas1	2022-08-05
SPBC1198.04c.2	zas1	2022-08-05
SPAC22A12.08c.2	crd1	2021-04-23
SPAC22A12.08c.1	crd1	2021-04-23
SPBC119.04.2	mei3	2021-04-20
SPBC17A3.07.1	pgr1	2021-04-20
SPBC17A3.07.2	pgr1	2021-04-20
SPBC119.04.1	mei3	2021-04-20
SPCC1620.02.1	wtf23	2020-09-14
SPCC162.04c.1	wtf13	2020-09-14
SPCC548.03c.1	wtf4	2020-09-14
SPCC1620.02.2	wtf23	2020-09-14
SPCC548.03c.2	wtf4	2020-09-14
SPCC162.04c.2	wtf13	2020-09-14
SPCC1906.03.2	wtf19	2020-09-11
SPCC1906.03.1	wtf19	2020-09-11

SPBC1198.04c.1 zas1 2022-08-05
SPBC1198.04c.2 zas1 2022-08-05
SPAC22A12.08c.2 crd1 2021-04-23
SPAC22A12.08c.1 crd1 2021-04-23
SPBC119.04.2 mei3 2021-04-20
SPBC17A3.07.1 pgr1 2021-04-20
SPBC17A3.07.2 pgr1 2021-04-20
SPBC119.04.1 mei3 2021-04-20
SPCC1620.02.1 wtf23 2020-09-14
SPCC162.04c.1 wtf13 2020-09-14
SPCC548.03c.1 wtf4 2020-09-14
SPCC1620.02.2 wtf23 2020-09-14
SPCC548.03c.2 wtf4 2020-09-14
SPCC162.04c.2 wtf13 2020-09-14
SPCC1906.03.2 wtf19 2020-09-11
SPCC1906.03.1 wtf19 2020-09-11

These ones aren't really new genes, (I think they were all annotated i the original submission), these changes represent the annotation of alternative transcripts

ValWood commented 1 year ago

There are some references in the warnings, for example https://www.pombase.org/term/PBO:0091685 SPBC1685.17, SPCC1840.13, SPCC622.01c, SPCC622.02, SPCC622.03c, SPCC622.07

but I am surprised that these have no "history" on the gene pages because I think they existed, then were deprecated, and then added back

ValWood commented 1 year ago

Similarly https://www.pombase.org/gene/SPBC21B10.14 has a warning, new gene which contains the reference.

there were quite a lot from this publication.

I would populate with the references from the "warnings" first, then see what is left. I can try to track down the remainder. Most that do not have a specific refernece I will have added, so we can create add a PomBase curators reference for those

There should also probably be a history menu item for new genes.

ValWood commented 1 year ago

Quite a few of the merged genes are the removal of alternative transcripts. I did this because they did not fit our criteria of having different sequence, or function (i.e the translation was the same, and they only represented alternative TSS or polyadenylation sites). So although they are 'removed' the gene still exists. Maybe these need to be flagged differently? The reference for these would be "pombase curators"

ValWood commented 1 year ago

All of the RNA merges would be Pomase curators. None of these were published (I was just merging identical ncRNAs from different sources).

manulera commented 1 year ago

but I am surprised that these have no "history" on the gene pages because I think they existed, then were deprecated, and then added back

I had a look at the history, and it looks OK to me. The history displayed is only that of the main feature (CDS for protein coding, RNA for others). If there has been changes to the UTRs this would not be shown. This has to be made clear maybe.

ValWood commented 1 year ago

I don't see a "history" section om this page though? (for example) https://www.pombase.org/gene/SPBC1685.17

manulera commented 1 year ago

I don't see a "history" section om this page though? (for example)

Checking, I only see a change to the 5'UTR (removed in 2023).

The history displayed is only that of the main feature (CDS for protein coding, RNA for others). If there has been changes to the UTRs this would not be shown. This has to be made clear maybe.

manulera commented 1 year ago

these changes represent the annotation of alternative transcripts

I have fixed this with https://github.com/pombase/genome_changelog/commit/32180798ad5b04c5ca34d753e7206173811f211e

The diff is comprehensive, if you want to double-check

ValWood commented 1 year ago

I don't see a "history" section this page though? (for example) https://www.pombase.org/gene/SPBC1685.17

this was an "added gene" https://www.pombase.org/status/new-protein

the problem might be that this existed, was removed, and added back...I expected to see this in the history...can discuss.

manulera commented 1 year ago

the problem might be that this existed, was removed, and added back...I expected to see this in the history...can discuss.

Hi @ValWood, the history section is meant to display changes to the main sequence feature of a gene (CDS in this case). If the CDS has never changed (that's the case for SPBC1685.17) the section is not shown. In this particular gene, only the 5'UTR was changed.

kimrutherford commented 1 year ago

I am thinking it would make more sense that in those that are empty we simply add all the PMIDs in the db_xref qualifiers

That might be helpful. I had a look at a sample of new genes that don't currently have references on the new protein coding genes page. There were some that had db_xrefs and they looked like they were relevant. So if it's easy to implement then I think it's a good plan.

I also noticed that there are quite a new genes that have an annotation like this:

FT                   /controlled_curation="term=warning, new gene;
FT                   db_xref=PMID:21511999; date=20110318"

They are displayed on the gene pages in the "Warnings" section. Example: https://www.pombase.org/gene/SPBC2F12.17

We have a page of them but it's quite hidden: https://www.pombase.org/term/PBO:0000082

I know it's extra work, but do you think we could include the details from these "new gene" warnings on the new gene lists?

ValWood commented 1 year ago

Yes, a good start to fill in the missing references would be to use the db_xref in new gene or gene structure updated

manulera commented 1 year ago

Ok, I have managed to do that, using controlled_curation of type warning or name description. Otherwise use db_xref, looking for the reference in the main feature first (CDS or RNA feature), and on the UTRs otherwise (there were some cases). That fills up most references for most new genes, so that's fixed.

https://github.com/pombase/genome_changelog/blob/b9a413df0f75c5c953d03a1e31538ced1c3a328d/genome_functions.py#L300-L302

manulera commented 1 year ago

I know it's extra work, but do you think we could include the details from these "new gene" warnings on the new gene lists?

That's what I have done. I realise other warnings could be used to fill missing references in the history of gene structures, such as the gene structure updated controlled curation. Any other ones I should think about?

manulera commented 1 year ago

I have also added "PomBase curators" to all merges that lack references.

manulera commented 1 year ago

Hello both, I had a look and the only assemblies with identifiers that I could find were:

The latest assembly (GCA_000002945.2 in ENA and ASM294v2)
- Chromosome 1: CU329670.1
- Chromosome 2: CU329671.1
- Chromosome 3: CU329672.1
- Mitochondria: X54421.1
The previous assembly (GCA_000002945.1 in ENA and ASM294v1)
- Chromosome 1: CU329670.1
- Chromosome 2: CU329671.1
- Chromosome 3: CU329672.1

The Chromosomes are identical between them and the same as the ones in svn, as you can see by their identifiers (I double-checked their sequences directly in the assembly). The only difference between these two assemblies is the addition of the mitochondria.

When compared to our current genome:

Now we have this assembly for the mitochondria: https://www.ncbi.nlm.nih.gov/nuccore/MK618072.1
We have this mating type region sequence: https://www.ncbi.nlm.nih.gov/nuccore/FP565355

I cannot find any identifier prior to that in NCBI or ENA for a full chromosome assembly, I can only find entries for individual cosmids: https://www.ebi.ac.uk/ena/browser/view/AL009197.1

So I think it's not going to be possible to add many identifiers to the genome version table, unless I am missing something. Did you deposit the full chromosome sequences somewhere else?

ValWood commented 1 year ago

Actually that makes sense.

https://www.pombase.org/status/sequencing-updates says Jan 2007 GeneDB moves from Contigs to Chromosomes Contigs merged into chromosomes: the 4 sequence gaps represented by 100 Ns (Note: The chromosomes previously made available from the ftp site had 1000 Ns in the gaps)

It sounds as though we had been assembling chromosomes, but still submitting the individual cosmids (at this point we were probably waiting for final sequences to do the chromosome assemblies, but this never happened).

Every sequence change listed after this point is described as pending.

Although I was wrong about the mitochondrial sequence. I didn't realize that was part of the assembly. This means we can try to submit it and see what happens.

manulera commented 1 year ago

To discuss next week:

I have updated the genome sequence updates auto-generated table, now it includes the accession numbers for genome revisions from pombase where the sequence matches that with the accession number.
Include / not include link in long list (we discussed this but I don't remember)
Announcement and versioning
https://www.pombase.org/status/sequencing-updates and the pending one. Should be maybe cleaned up a little bit and explained whether the changes are present or not in the current assembly.

manulera commented 1 year ago

Hi @ValWood:

Changes that are not in the broad list:

Aug 29 2008 Chromosome 2 cosmid c21D10 affecting SPBC21D10.06c/Map4

PENDING 2008-08-29

Map4 in the reference genome contains an array of 5 repeats. In PMID:168571979 repeats are reported. This number is the correct number of repeats and will be updated via an insertion into the contig sequence shortly. Pers. comm. Henar Valdivieso. Reported 2005-06-15

Aug 29 2008 Chromosome 1 cosmid c27D7 misassembly causing duplication of SPAC27D7.09c (SPAC27D7.10c)

PENDING 2008-08-29

An apparent repeat region on chromosome 1 coordinates 4526059..4529095 (cosmid c27D7) is caused by a missassembly and will be removed from the genomic sequence shortly. The CDS feature SPAC27D7.10c within this region is an exact duplication of SPAC27D7.09c and will be merged with this CDS. Pers. comm. Klavs Hansen. Reported 2004-09-01

ValWood commented 1 year ago

c27D7 has a large insert @ 4526059..4529095 caused by misassembly. Sequence here: c27D7-insert.txt

Map4 PMID:16857197 reports 4 missing repeats @ AA 797 SWVTETVTSGSVEFTTTIATPVGSTAGTVLVDIPTP SWVTETVTSGSVEFTTTIATPVGTTAGTVVVDIPTP SWVTETVTSGSVGFTTTIATPIGTTAGTVLVDIPTP SWVTETVTSGSVGFTTTIATPVGTTAGTVLIDVPTP

I'm not so sure about this one. Our finishers at Sanger were really good and sequenced repetitive regions multiple times with multiple different technologies from both strands. It is more likely that a small sequencing unit would bet this wrong than the Sanger. For now, unless this same error crops up in other sequenced genomes I will just add a comment to describe this proposed discrepancy. We can leave it out of the pending alteration list.

ValWood commented 1 year ago

I added this to map4 instead (no need to report) /controlled_curation="term=warning, 9 [SWVTETVTSGSVGFTTTIATPIGTTAGTVLVDIPTP-consensus] repeats reported (5 copies in reference); db_xref=PMID:16857197; date=20230704"

manulera commented 1 year ago

For @manulera

[ ] validate existing dataset.

ValWood commented 8 months ago

What still needs to happen to announce the change log (we didn't announce this did we?)

manulera commented 8 months ago

We didn't announce it yet. We wanted to have #2058 to use the occasion to tell people how to refer to a particular genome/dataset version.

pombase / website

Final touches to changelog before announcing #2042

Aug 29 2008 Chromosome 2 cosmid c21D10 affecting SPBC21D10.06c/Map4

Aug 29 2008 Chromosome 1 cosmid c27D7 misassembly causing duplication of SPAC27D7.09c (SPAC27D7.10c)