schatzlab / scikit-ribo

Accurate estimation and robust modelling of translation dynamics at codon resolution
GNU General Public License v2.0
18 stars 8 forks source link

Incorrect codon files generated by scikit-ribo-build for mm10 #7

Open nabeel-bioinfo opened 6 years ago

nabeel-bioinfo commented 6 years ago

I have been trying to use scikit-ribo on mouse ESC Ribo-Seq data but the output files generated from scikit-ribo-build do no correctly parse the given input GFF file. I am using the mm10_knownGene GFF file as input with genome file obtained from http://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/. Specifically, there is inaccurate assignment of codons of transcripts.

For example, for gene uc008cjg.1, the start codon is at chr17: 35916361-35916363 while in the files mm10.codons.df and mm10.codons.bed, this chromosome location is at codon 2 $ cat mm10.codons.bed | grep 'uc008cjg.1' | head -12

chr17   35916390    35916393    uc008cjg.1  -8  -   CGG
chr17   35916387    35916390    uc008cjg.1  -7  -   CAG
chr17   35916384    35916387    uc008cjg.1  -6  -   GCC
chr17   35916381    35916384    uc008cjg.1  -5  -   GTA
chr17   35916378    35916381    uc008cjg.1  -4  -   CGT
chr17   35916375    35916378    uc008cjg.1  -3  -   CCG
chr17   35916372    35916375    uc008cjg.1  -2  -   GTG
chr17   35916369    35916372    uc008cjg.1  -1  -   TGC
chr17   35916366    35916369    uc008cjg.1  0   -   GTC
chr17   35916363    35916366    uc008cjg.1  1   -   AAG
chr17   35916360    35916363    uc008cjg.1  2   -   ATG
chr17   35916357    35916360    uc008cjg.1  3   -   GCT

Another example where the codon assignments are entirely different from the reference is uc007vnb.1. The start codon is at chr15:37003817-37003820 but the codon file has very different annotations.

$cat mm10.codons.bed | grep 'uc007vnb.1' | head -20

chr15   37007423    37007426    uc007vnb.1  -8  -   CCG
chr15   37007420    37007423    uc007vnb.1  -7  -   CAG
chr15   37007417    37007420    uc007vnb.1  -6  -   GAA
chr15   37007414    37007417    uc007vnb.1  -5  -   GGT
chr15   37007411    37007414    uc007vnb.1  -4  -   GAG
chr15   37007408    37007411    uc007vnb.1  -3  -   GGG
chr15   37007405    37007408    uc007vnb.1  -2  -   CGG
chr15   37007402    37007405    uc007vnb.1  -1  -   GAC
chr15   37007399    37007402    uc007vnb.1  0   -   CGC
chr15   37007396    37007399    uc007vnb.1  1   -   GCG
chr15   37007393    37007396    uc007vnb.1  2   -   GGC
chr15   37007390    37007393    uc007vnb.1  3   -   AAG

Assuming that codon 0 is the start codon in this file, I found that only 3380(7%) of 46355 transcripts have ATG as start codon

cat mm10.codons.bed | awk '{if($5==0)print $5,$7}' | sort | uniq -c

727 0 AAA
    436 0 AAC
    748 0 AAG
    376 0 AAT
    686 0 ACA
    473 0 ACC
    295 0 ACG
    854 0 ACT
   1768 0 AGA
   1212 0 AGC
   1207 0 AGG
   1724 0 AGT
    377 0 ATA
    522 0 ATC
   **3380 0 ATG**
    704 0 ATT
    228 0 CAA
    377 0 CAC
    493 0 CAG
    167 0 CAT
    400 0 CCA
    569 0 CCC
    399 0 CCG
    477 0 CCT
    188 0 CGA
    602 0 CGC
    637 0 CGG
    136 0 CGT
    202 0 CTA
    831 0 CTC
    625 0 CTG
    594 0 CTT
    990 0 GAA
    779 0 GAC
   1888 0 GAG
    585 0 GAT
    921 0 GCA
   1177 0 GCC
   1318 0 GCG
    928 0 GCT
   1856 0 GGA
   1691 0 GGC
   2329 0 GGG
   1037 0 GGT
    523 0 GTA
   1015 0 GTC
   1304 0 GTG
    916 0 GTT
     90 0 TAA
     88 0 TAC
    184 0 TAG
    115 0 TAT
    284 0 TCA
    401 0 TCC
    162 0 TCG
    386 0 TCT
    285 0 TGA
    392 0 TGC
    460 0 TGG
    270 0 TGT
    155 0 TTA
    506 0 TTC
    274 0 TTG
    632 0 TTT

I hope this bug can be fixed so that scikit-ribo can be run on mm10 data.

mschatz commented 6 years ago

Yes, we are aware of some issues when processing higher eukaryotic genomes that seems to be related to the way that UTRs and spliced transcripts are processed by the libraries we are using. We are hoping to address this in the near future, but unfortunately, we do not have a specific timeline available at this time. Please keep an eye on the github pages for a future announcement.

Thank you,

Mike

On Thu, Jul 26, 2018 at 1:41 PM nabeel1990 notifications@github.com wrote:

I have been trying to use scikit-ribo on mouse ESC Ribo-Seq data but the output files generated from scikit-ribo-build do no correctly parse the given input GFF file. I am using the mm10_knownGene GFF file as input with genome file obtained from http://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/. Specifically, there is inaccurate assignment of codons of transcripts.

For example, for gene uc008cjg.1, the start codon is at chr17: 35916361-35916363 while the in the files mm10.codons.df and mm10.codons.bed, this chromosome location is at codon 2 $ cat mm10.codons.bed | grep 'uc008cjg.1' | head -12

chr17 35916390 35916393 uc008cjg.1 -8 - CGG chr17 35916387 35916390 uc008cjg.1 -7 - CAG chr17 35916384 35916387 uc008cjg.1 -6 - GCC chr17 35916381 35916384 uc008cjg.1 -5 - GTA chr17 35916378 35916381 uc008cjg.1 -4 - CGT chr17 35916375 35916378 uc008cjg.1 -3 - CCG chr17 35916372 35916375 uc008cjg.1 -2 - GTG chr17 35916369 35916372 uc008cjg.1 -1 - TGC chr17 35916366 35916369 uc008cjg.1 0 - GTC chr17 35916363 35916366 uc008cjg.1 1 - AAG chr17 35916360 35916363 uc008cjg.1 2 - ATG chr17 35916357 35916360 uc008cjg.1 3 - GCT

Another example where the codon assignments are entirely different from the reference is uc007vnb.1. The start codon is at chr15:37003817-37003820 but the codon file has very different annotations.

$cat mm10.codons.bed | grep 'uc007vnb.1' | head -20

chr15 37007423 37007426 uc007vnb.1 -8 - CCG chr15 37007420 37007423 uc007vnb.1 -7 - CAG chr15 37007417 37007420 uc007vnb.1 -6 - GAA chr15 37007414 37007417 uc007vnb.1 -5 - GGT chr15 37007411 37007414 uc007vnb.1 -4 - GAG chr15 37007408 37007411 uc007vnb.1 -3 - GGG chr15 37007405 37007408 uc007vnb.1 -2 - CGG chr15 37007402 37007405 uc007vnb.1 -1 - GAC chr15 37007399 37007402 uc007vnb.1 0 - CGC chr15 37007396 37007399 uc007vnb.1 1 - GCG chr15 37007393 37007396 uc007vnb.1 2 - GGC chr15 37007390 37007393 uc007vnb.1 3 - AAG

Assuming that codon 0 is the start codon in this file, I found that only 3380(7%) of 46355 transcripts have ATG as start codon

cat mm10.codons.bed | awk '{if($5==0)print $5,$7}' | sort | uniq -c

727 0 AAA 436 0 AAC 748 0 AAG 376 0 AAT 686 0 ACA 473 0 ACC 295 0 ACG 854 0 ACT 1768 0 AGA 1212 0 AGC 1207 0 AGG 1724 0 AGT 377 0 ATA 522 0 ATC 3380 0 ATG 704 0 ATT 228 0 CAA 377 0 CAC 493 0 CAG 167 0 CAT 400 0 CCA 569 0 CCC 399 0 CCG 477 0 CCT 188 0 CGA 602 0 CGC 637 0 CGG 136 0 CGT 202 0 CTA 831 0 CTC 625 0 CTG 594 0 CTT 990 0 GAA 779 0 GAC 1888 0 GAG 585 0 GAT 921 0 GCA 1177 0 GCC 1318 0 GCG 928 0 GCT 1856 0 GGA 1691 0 GGC 2329 0 GGG 1037 0 GGT 523 0 GTA 1015 0 GTC 1304 0 GTG 916 0 GTT 90 0 TAA 88 0 TAC 184 0 TAG 115 0 TAT 284 0 TCA 401 0 TCC 162 0 TCG 386 0 TCT 285 0 TGA 392 0 TGC 460 0 TGG 270 0 TGT 155 0 TTA 506 0 TTC 274 0 TTG 632 0 TTT

I hope this bug can be fixed so that scikit-ribo can be run on mm10 data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/schatzlab/scikit-ribo/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL98wivj7oXg4ucu14SeJREotb58CHvks5uKf9ZgaJpZM4ViQ40 .