pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

UTRs task1 Mata poly A sites #819

Closed ValWood closed 2 years ago

ValWood commented 3 years ago

Conditions: 972 h-cells were grown in Edinburgh Minimal Medium (EMM) at 32 °C

I'm trying to establish the best way to get the cluster peak site into Artemis.

headings are

Systematic name Common name Chromosome Strand UTR start UTR end Cluster number Cluster start Cluster length Peak position Total reads in cluster Fraction of reads in cluster
SPAC1002.02 mug31 1 + 1799818 1 1 244 45 248 135 0.97826087
SPAC1002.03c gls2 1 - 1799715 1 1 84 121 164 266 0.84984026
SPAC1002.04c taf11 1 - 1803424 1 1 252 54 288 111 0.77622378
SPAC1002.04c taf11 1 - 1803424 2 2 145 12 149 31 0.21678322
SPAC1002.05c jmj2 1 - 1804348 1 1 41 46 63 61 1

I can't find the information about precisely what the column headings mean? Specifically:

  1. How UTR end is defined (this always seems to be a small integer and is exactly the same as the data in next column "cluster number" @juanmatacambridge is this correct?)
  2. Do we just need to take UTR start and add or subtract 'peak position' (dependent on strand) to obtain the genome position of the cluster peak ?
  3. Does anyone recall where the data files which provided the genome browser data came from, it does not seem to be the files provided in the manuscript as this dataset has multiple peaks but in the SUPP data spreadsheet there is a much smaller number of peaks

e.g.

SPAC12B10.01c |   | 1 | - | 4570105 | 1 | 1 | 28 | 85 | 108 | 110 | 0.96491228

has a single peak position in the spreadsheet, but in the browser there are many clusters called:

https://www.pombase.org/jbrowse/?loc=I%3A4558733..4578733&tracks=PomBase%20forward%20strand%20features%2CPomBase%20reverse%20strand%20features%2CPoly(A)%20sites%20during%20vegetative%20growth%20(forward%20strand)%20-%20Mata%20(2013)%2CPoly(A)%20sites%20during%20vegetative%20growth%20(reverse%20strand)%20-%20Mata%20(2013)&highlight=

(unless I am misunderstanding what I am looking at)?

juanmatacambridge commented 3 years ago

To compile a table of annotated CS (or TTS) with a single number you should use the data in table S7, which gives the length of the 3'-UTRs as annotated (at the time of writing), as defined by a cluster of CS, or as defined by the most distant CS. The second data set (cluster) is probably the best, as it gives the length of the 'typical' 3'-UTR, Simply add the length of the 3'-UTR to your coordinates (end of CDS). These data have been validated independently by the Crammer lab (https://www.ncbi.nlm.nih.gov/pubmed/26883383) - the median difference between their data and our dataset is 3 nt.

ValWood commented 3 years ago

For the 3' UTRs, @kimrutherford will create features using

the existing 3' end of the CDS and Dataset7 /columnD / Length of Longest UTR defined by cluster peaks from 2013RNABIOL0135R-Supdata.xlsx

This means I will only need to make decisions for the TSS ;)

Need feature type 3'UTR /SO=SO:0000205 /systematic_ID=xxxxxxx /db_xref=PMID:23900342 /note="feature created using major cleavage site peak, vegetative growth, EMM"

~We could even link to the tracks using the coordinates from the transcript section~

kimrutherford commented 3 years ago

We could even link to the tracks using the coordinates from the transcript section

I don't understand that bit.

kimrutherford commented 3 years ago

Out of curiosity, I pulled out the longest UTR from the dataset, which is:

SPCC320.06         |            | 5636            |               5161 | 5267  

It doesn't seem to match what's in JBrowse?: https://www.pombase.org/jbrowse/?loc=III%3A147551..163550&tracks=PomBase%20forward%20strand%20features%2CPomBase%20reverse%20strand%20features%2CPoly(A)%20sites%20during%20vegetative%20growth%20(forward%20strand)%20-%20Mata%20(2013)%2CPoly(A)%20sites%20during%20vegetative%20growth%20(reverse%20strand)%20-%20Mata%20(2013)&highlight=

I thought there should be a peak at 158577-5161 = 153416

kimrutherford commented 3 years ago

There were a few entries in the dataset that used synonyms. Those were easy to fix. There are four other IDs that I need help with:

I can't find these two in PomBase: SPAC4H3.12c SPAC1F12.03c

SPBC1685.12c is not a current ID but there is a similar ID: SPBC1685.12c-antisense-1

SPBC8E4.02c is the ID of an RNA (SPNCRNA.9001) so I can't make a UTR. But that RNA overlaps a coding gene (SPBC8E4.01c) so maybe there was a mix up when the data set was created?

Some gene IDs (eg. SPBC13E7.09) have multiple lines in the data file but the value in the "Length of Longest UTR defined by cluster peaks" column is the same for each line so not a problem.

kimrutherford commented 3 years ago

Here's a first attempt: https://curation.pombase.org/kmr44/mata-polyA-sites-utrs-for-artemis.tar.gz

Now I have a script it's easy to re-create these files if things need tweaking. I added a /color= qualifier to make them stand out.

For the record, here's the script: utr_process_ticket_819.pl.txt

ValWood commented 3 years ago

We could even link to the tracks using the coordinates from the transcript section

I don't understand that bit.

yeah ignore that.

ValWood commented 3 years ago

I thought there should be a peak at 158577-5161 = 153416

There is a peak at 153416. Are you looking at the reverse strand? It isn't the strongest peak of all, but it is the highest peak in the cluster that would give the longest UTR.

kimrutherford commented 3 years ago

There is a peak at 153416. Are you looking at the reverse strand?

Yep, I was looking at the wrong strand. :-)

ValWood commented 3 years ago

https://www.pombase.org/status/new-and-removed-genes SPAC4H3.12c | not protein-coding (PMID:24929437, PMID:26615217) | 2015-12-11 SPAC1F12.03c | removed; replaced by a nuclear mitochondrial pseudogene (NUMT) feature | 2012-07-16 SPBC1685.12c | not protein-coding (PMID:24929437, PMID:26615217) | 2015-12-11

ValWood commented 3 years ago

Yep, I was looking at the wrong strand. :-)

I knew that - because I was too for the first few minutes ;-)

This is good. Absolutely spot on!

kimrutherford commented 3 years ago

SPBC8E4.02c is the ID of an RNA ...

That should be: SPBC8E4.02c is a synonym of an RNA ...

SPBC8E4.02c is the ID of an RNA (SPNCRNA.9001) so I can't make a UTR. But that RNA overlaps a coding gene (SPBC8E4.01c) so maybe there was a mix up when the data set was created?

Now I look more closely, I'm not sure that's true as there is a line for SPBC8E4.01c as well:

SPBC8E4.01c             230     230     425

It's different to SPBC8E4.02c

SPBC8E4.02c             647     284     617
ValWood commented 2 years ago

closing, new tasks will have new tickets.