Open manulera opened 1 year ago
For these,
SPBC1198.04c.1 | zas1 | 2022-08-05 | ||
---|---|---|---|---|
SPBC1198.04c.2 | zas1 | 2022-08-05 | ||
SPAC22A12.08c.2 | crd1 | 2021-04-23 | ||
SPAC22A12.08c.1 | crd1 | 2021-04-23 | ||
SPBC119.04.2 | mei3 | 2021-04-20 | ||
SPBC17A3.07.1 | pgr1 | 2021-04-20 | ||
SPBC17A3.07.2 | pgr1 | 2021-04-20 | ||
SPBC119.04.1 | mei3 | 2021-04-20 | ||
SPCC1620.02.1 | wtf23 | 2020-09-14 | ||
SPCC162.04c.1 | wtf13 | 2020-09-14 | ||
SPCC548.03c.1 | wtf4 | 2020-09-14 | ||
SPCC1620.02.2 | wtf23 | 2020-09-14 | ||
SPCC548.03c.2 | wtf4 | 2020-09-14 | ||
SPCC162.04c.2 | wtf13 | 2020-09-14 | ||
SPCC1906.03.2 | wtf19 | 2020-09-11 | ||
SPCC1906.03.1 | wtf19 | 2020-09-11 |
SPBC1198.04c.1 zas1 2022-08-05
SPBC1198.04c.2 zas1 2022-08-05
SPAC22A12.08c.2 crd1 2021-04-23
SPAC22A12.08c.1 crd1 2021-04-23
SPBC119.04.2 mei3 2021-04-20
SPBC17A3.07.1 pgr1 2021-04-20
SPBC17A3.07.2 pgr1 2021-04-20
SPBC119.04.1 mei3 2021-04-20
SPCC1620.02.1 wtf23 2020-09-14
SPCC162.04c.1 wtf13 2020-09-14
SPCC548.03c.1 wtf4 2020-09-14
SPCC1620.02.2 wtf23 2020-09-14
SPCC548.03c.2 wtf4 2020-09-14
SPCC162.04c.2 wtf13 2020-09-14
SPCC1906.03.2 wtf19 2020-09-11
SPCC1906.03.1 wtf19 2020-09-11
These ones aren't really new genes, (I think they were all annotated i the original submission), these changes represent the annotation of alternative transcripts
There are some references in the warnings, for example https://www.pombase.org/term/PBO:0091685 SPBC1685.17, SPCC1840.13, SPCC622.01c, SPCC622.02, SPCC622.03c, SPCC622.07
but I am surprised that these have no "history" on the gene pages because I think they existed, then were deprecated, and then added back
Similarly https://www.pombase.org/gene/SPBC21B10.14 has a warning, new gene which contains the reference.
there were quite a lot from this publication.
I would populate with the references from the "warnings" first, then see what is left. I can try to track down the remainder. Most that do not have a specific refernece I will have added, so we can create add a PomBase curators reference for those
There should also probably be a history menu item for new genes.
Quite a few of the merged genes are the removal of alternative transcripts. I did this because they did not fit our criteria of having different sequence, or function (i.e the translation was the same, and they only represented alternative TSS or polyadenylation sites). So although they are 'removed' the gene still exists. Maybe these need to be flagged differently? The reference for these would be "pombase curators"
All of the RNA merges would be Pomase curators. None of these were published (I was just merging identical ncRNAs from different sources).
but I am surprised that these have no "history" on the gene pages because I think they existed, then were deprecated, and then added back
I had a look at the history, and it looks OK to me. The history displayed is only that of the main feature (CDS for protein coding, RNA for others). If there has been changes to the UTRs this would not be shown. This has to be made clear maybe.
I don't see a "history" section om this page though? (for example) https://www.pombase.org/gene/SPBC1685.17
I don't see a "history" section om this page though? (for example)
Checking, I only see a change to the 5'UTR (removed in 2023).
The history displayed is only that of the main feature (CDS for protein coding, RNA for others). If there has been changes to the UTRs this would not be shown. This has to be made clear maybe.
these changes represent the annotation of alternative transcripts
I have fixed this with https://github.com/pombase/genome_changelog/commit/32180798ad5b04c5ca34d753e7206173811f211e
The diff is comprehensive, if you want to double-check
I don't see a "history" section this page though? (for example) https://www.pombase.org/gene/SPBC1685.17
this was an "added gene" https://www.pombase.org/status/new-protein
the problem might be that this existed, was removed, and added back...I expected to see this in the history...can discuss.
the problem might be that this existed, was removed, and added back...I expected to see this in the history...can discuss.
Hi @ValWood, the history section is meant to display changes to the main sequence feature of a gene (CDS in this case). If the CDS has never changed (that's the case for SPBC1685.17) the section is not shown. In this particular gene, only the 5'UTR was changed.
I am thinking it would make more sense that in those that are empty we simply add all the PMIDs in the db_xref qualifiers
That might be helpful. I had a look at a sample of new genes that don't currently have references on the new protein coding genes page. There were some that had db_xrefs and they looked like they were relevant. So if it's easy to implement then I think it's a good plan.
I also noticed that there are quite a new genes that have an annotation like this:
FT /controlled_curation="term=warning, new gene;
FT db_xref=PMID:21511999; date=20110318"
They are displayed on the gene pages in the "Warnings" section. Example: https://www.pombase.org/gene/SPBC2F12.17
We have a page of them but it's quite hidden: https://www.pombase.org/term/PBO:0000082
I know it's extra work, but do you think we could include the details from these "new gene" warnings on the new gene lists?
Yes, a good start to fill in the missing references would be to use the db_xref in new gene or gene structure updated
Ok, I have managed to do that, using controlled_curation
of type warning
or name description
. Otherwise use db_xref
, looking for the reference in the main feature first (CDS
or RNA feature), and on the UTRs otherwise (there were some cases). That fills up most references for most new genes, so that's fixed.
I know it's extra work, but do you think we could include the details from these "new gene" warnings on the new gene lists?
That's what I have done. I realise other warnings could be used to fill missing references in the history of gene structures, such as the gene structure updated
controlled curation. Any other ones I should think about?
I have also added "PomBase curators" to all merges that lack references.
Hello both, I had a look and the only assemblies with identifiers that I could find were:
The latest assembly (GCA_000002945.2
in ENA and ASM294v2
)
CU329670.1
CU329671.1
CU329672.1
X54421.1
The previous assembly (GCA_000002945.1
in ENA and ASM294v1
)
CU329670.1
CU329671.1
CU329672.1
The Chromosomes are identical between them and the same as the ones in svn, as you can see by their identifiers (I double-checked their sequences directly in the assembly). The only difference between these two assemblies is the addition of the mitochondria.
When compared to our current genome:
https://www.ncbi.nlm.nih.gov/nuccore/MK618072.1
I cannot find any identifier prior to that in NCBI or ENA for a full chromosome assembly, I can only find entries for individual cosmids: https://www.ebi.ac.uk/ena/browser/view/AL009197.1
So I think it's not going to be possible to add many identifiers to the genome version table, unless I am missing something. Did you deposit the full chromosome sequences somewhere else?
Actually that makes sense.
https://www.pombase.org/status/sequencing-updates says Jan 2007 GeneDB moves from Contigs to Chromosomes Contigs merged into chromosomes: the 4 sequence gaps represented by 100 Ns (Note: The chromosomes previously made available from the ftp site had 1000 Ns in the gaps)
It sounds as though we had been assembling chromosomes, but still submitting the individual cosmids (at this point we were probably waiting for final sequences to do the chromosome assemblies, but this never happened).
Every sequence change listed after this point is described as pending.
Although I was wrong about the mitochondrial sequence. I didn't realize that was part of the assembly. This means we can try to submit it and see what happens.
To discuss next week:
Hi @ValWood:
Changes that are not in the broad list:
PENDING 2008-08-29
Map4 in the reference genome contains an array of 5 repeats. In PMID:168571979 repeats are reported. This number is the correct number of repeats and will be updated via an insertion into the contig sequence shortly. Pers. comm. Henar Valdivieso. Reported 2005-06-15
PENDING 2008-08-29
An apparent repeat region on chromosome 1 coordinates 4526059..4529095 (cosmid c27D7) is caused by a missassembly and will be removed from the genomic sequence shortly. The CDS feature SPAC27D7.10c within this region is an exact duplication of SPAC27D7.09c and will be merged with this CDS. Pers. comm. Klavs Hansen. Reported 2004-09-01
c27D7 has a large insert @ 4526059..4529095 caused by misassembly. Sequence here: c27D7-insert.txt
Map4 PMID:16857197 reports 4 missing repeats @ AA 797 SWVTETVTSGSVEFTTTIATPVGSTAGTVLVDIPTP SWVTETVTSGSVEFTTTIATPVGTTAGTVVVDIPTP SWVTETVTSGSVGFTTTIATPIGTTAGTVLVDIPTP SWVTETVTSGSVGFTTTIATPVGTTAGTVLIDVPTP
I'm not so sure about this one. Our finishers at Sanger were really good and sequenced repetitive regions multiple times with multiple different technologies from both strands. It is more likely that a small sequencing unit would bet this wrong than the Sanger. For now, unless this same error crops up in other sequenced genomes I will just add a comment to describe this proposed discrepancy. We can leave it out of the pending alteration list.
I added this to map4 instead (no need to report) /controlled_curation="term=warning, 9 [SWVTETVTSGSVGFTTTIATPIGTTAGTVLVDIPTP-consensus] repeats reported (5 copies in reference); db_xref=PMID:16857197; date=20230704"
For @manulera
What still needs to happen to announce the change log (we didn't announce this did we?)
We didn't announce it yet. We wanted to have #2058 to use the occasion to tell people how to refer to a particular genome/dataset version.
Hi @kimrutherford @ValWood,
I was drafting the email to announce the changes, but I realised a few things that perhaps we should address:
svn:revision-number
orftp:ftp-folder
, e.g.svn:2342
,ftp:20110204
.We could discuss this next week if we have call.