Closed ValWood closed 2 years ago
To compile a table of annotated CS (or TTS) with a single number you should use the data in table S7, which gives the length of the 3'-UTRs as annotated (at the time of writing), as defined by a cluster of CS, or as defined by the most distant CS. The second data set (cluster) is probably the best, as it gives the length of the 'typical' 3'-UTR, Simply add the length of the 3'-UTR to your coordinates (end of CDS). These data have been validated independently by the Crammer lab (https://www.ncbi.nlm.nih.gov/pubmed/26883383) - the median difference between their data and our dataset is 3 nt.
For the 3' UTRs, @kimrutherford will create features using
the existing 3' end of the CDS and Dataset7 /columnD / Length of Longest UTR defined by cluster peaks from 2013RNABIOL0135R-Supdata.xlsx
This means I will only need to make decisions for the TSS ;)
Need feature type 3'UTR /SO=SO:0000205 /systematic_ID=xxxxxxx /db_xref=PMID:23900342 /note="feature created using major cleavage site peak, vegetative growth, EMM"
~We could even link to the tracks using the coordinates from the transcript section~
We could even link to the tracks using the coordinates from the transcript section
I don't understand that bit.
Out of curiosity, I pulled out the longest UTR from the dataset, which is:
SPCC320.06 | | 5636 | 5161 | 5267
It doesn't seem to match what's in JBrowse?: https://www.pombase.org/jbrowse/?loc=III%3A147551..163550&tracks=PomBase%20forward%20strand%20features%2CPomBase%20reverse%20strand%20features%2CPoly(A)%20sites%20during%20vegetative%20growth%20(forward%20strand)%20-%20Mata%20(2013)%2CPoly(A)%20sites%20during%20vegetative%20growth%20(reverse%20strand)%20-%20Mata%20(2013)&highlight=
I thought there should be a peak at 158577-5161 = 153416
There were a few entries in the dataset that used synonyms. Those were easy to fix. There are four other IDs that I need help with:
I can't find these two in PomBase:
SPAC4H3.12c
SPAC1F12.03c
SPBC1685.12c
is not a current ID but there is a similar ID: SPBC1685.12c-antisense-1
SPBC8E4.02c
is the ID of an RNA (SPNCRNA.9001
) so I can't make a UTR. But that RNA overlaps a coding gene (SPBC8E4.01c
) so maybe there was a mix up when the data set was created?
Some gene IDs (eg. SPBC13E7.09
) have multiple lines in the data file but the value in the "Length of Longest UTR defined by cluster peaks" column is the same for each line so not a problem.
Here's a first attempt: https://curation.pombase.org/kmr44/mata-polyA-sites-utrs-for-artemis.tar.gz
Now I have a script it's easy to re-create these files if things need tweaking.
I added a /color=
qualifier to make them stand out.
For the record, here's the script: utr_process_ticket_819.pl.txt
We could even link to the tracks using the coordinates from the transcript section
I don't understand that bit.
yeah ignore that.
I thought there should be a peak at 158577-5161 = 153416
There is a peak at 153416. Are you looking at the reverse strand? It isn't the strongest peak of all, but it is the highest peak in the cluster that would give the longest UTR.
There is a peak at 153416. Are you looking at the reverse strand?
Yep, I was looking at the wrong strand. :-)
https://www.pombase.org/status/new-and-removed-genes SPAC4H3.12c | not protein-coding (PMID:24929437, PMID:26615217) | 2015-12-11 SPAC1F12.03c | removed; replaced by a nuclear mitochondrial pseudogene (NUMT) feature | 2012-07-16 SPBC1685.12c | not protein-coding (PMID:24929437, PMID:26615217) | 2015-12-11
Yep, I was looking at the wrong strand. :-)
I knew that - because I was too for the first few minutes ;-)
This is good. Absolutely spot on!
SPBC8E4.02c is the ID of an RNA ...
That should be: SPBC8E4.02c is a synonym of an RNA ...
SPBC8E4.02c is the ID of an RNA (SPNCRNA.9001) so I can't make a UTR. But that RNA overlaps a coding gene (SPBC8E4.01c) so maybe there was a mix up when the data set was created?
Now I look more closely, I'm not sure that's true as there is a line for SPBC8E4.01c as well:
SPBC8E4.01c 230 230 425
It's different to SPBC8E4.02c
SPBC8E4.02c 647 284 617
closing, new tasks will have new tickets.
Conditions: 972 h-cells were grown in Edinburgh Minimal Medium (EMM) at 32 °C
I'm trying to establish the best way to get the cluster peak site into Artemis.
headings are
I can't find the information about precisely what the column headings mean? Specifically:
e.g.
SPAC12B10.01c | | 1 | - | 4570105 | 1 | 1 | 28 | 85 | 108 | 110 | 0.96491228
has a single peak position in the spreadsheet, but in the browser there are many clusters called:
https://www.pombase.org/jbrowse/?loc=I%3A4558733..4578733&tracks=PomBase%20forward%20strand%20features%2CPomBase%20reverse%20strand%20features%2CPoly(A)%20sites%20during%20vegetative%20growth%20(forward%20strand)%20-%20Mata%20(2013)%2CPoly(A)%20sites%20during%20vegetative%20growth%20(reverse%20strand)%20-%20Mata%20(2013)&highlight=
(unless I am misunderstanding what I am looking at)?