Open nabeel-bioinfo opened 6 years ago
Yes, we are aware of some issues when processing higher eukaryotic genomes that seems to be related to the way that UTRs and spliced transcripts are processed by the libraries we are using. We are hoping to address this in the near future, but unfortunately, we do not have a specific timeline available at this time. Please keep an eye on the github pages for a future announcement.
Thank you,
Mike
On Thu, Jul 26, 2018 at 1:41 PM nabeel1990 notifications@github.com wrote:
I have been trying to use scikit-ribo on mouse ESC Ribo-Seq data but the output files generated from scikit-ribo-build do no correctly parse the given input GFF file. I am using the mm10_knownGene GFF file as input with genome file obtained from http://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/. Specifically, there is inaccurate assignment of codons of transcripts.
For example, for gene uc008cjg.1, the start codon is at chr17: 35916361-35916363 while the in the files mm10.codons.df and mm10.codons.bed, this chromosome location is at codon 2 $ cat mm10.codons.bed | grep 'uc008cjg.1' | head -12
chr17 35916390 35916393 uc008cjg.1 -8 - CGG chr17 35916387 35916390 uc008cjg.1 -7 - CAG chr17 35916384 35916387 uc008cjg.1 -6 - GCC chr17 35916381 35916384 uc008cjg.1 -5 - GTA chr17 35916378 35916381 uc008cjg.1 -4 - CGT chr17 35916375 35916378 uc008cjg.1 -3 - CCG chr17 35916372 35916375 uc008cjg.1 -2 - GTG chr17 35916369 35916372 uc008cjg.1 -1 - TGC chr17 35916366 35916369 uc008cjg.1 0 - GTC chr17 35916363 35916366 uc008cjg.1 1 - AAG chr17 35916360 35916363 uc008cjg.1 2 - ATG chr17 35916357 35916360 uc008cjg.1 3 - GCT
Another example where the codon assignments are entirely different from the reference is uc007vnb.1. The start codon is at chr15:37003817-37003820 but the codon file has very different annotations.
$cat mm10.codons.bed | grep 'uc007vnb.1' | head -20
chr15 37007423 37007426 uc007vnb.1 -8 - CCG chr15 37007420 37007423 uc007vnb.1 -7 - CAG chr15 37007417 37007420 uc007vnb.1 -6 - GAA chr15 37007414 37007417 uc007vnb.1 -5 - GGT chr15 37007411 37007414 uc007vnb.1 -4 - GAG chr15 37007408 37007411 uc007vnb.1 -3 - GGG chr15 37007405 37007408 uc007vnb.1 -2 - CGG chr15 37007402 37007405 uc007vnb.1 -1 - GAC chr15 37007399 37007402 uc007vnb.1 0 - CGC chr15 37007396 37007399 uc007vnb.1 1 - GCG chr15 37007393 37007396 uc007vnb.1 2 - GGC chr15 37007390 37007393 uc007vnb.1 3 - AAG
Assuming that codon 0 is the start codon in this file, I found that only 3380(7%) of 46355 transcripts have ATG as start codon
cat mm10.codons.bed | awk '{if($5==0)print $5,$7}' | sort | uniq -c
727 0 AAA 436 0 AAC 748 0 AAG 376 0 AAT 686 0 ACA 473 0 ACC 295 0 ACG 854 0 ACT 1768 0 AGA 1212 0 AGC 1207 0 AGG 1724 0 AGT 377 0 ATA 522 0 ATC 3380 0 ATG 704 0 ATT 228 0 CAA 377 0 CAC 493 0 CAG 167 0 CAT 400 0 CCA 569 0 CCC 399 0 CCG 477 0 CCT 188 0 CGA 602 0 CGC 637 0 CGG 136 0 CGT 202 0 CTA 831 0 CTC 625 0 CTG 594 0 CTT 990 0 GAA 779 0 GAC 1888 0 GAG 585 0 GAT 921 0 GCA 1177 0 GCC 1318 0 GCG 928 0 GCT 1856 0 GGA 1691 0 GGC 2329 0 GGG 1037 0 GGT 523 0 GTA 1015 0 GTC 1304 0 GTG 916 0 GTT 90 0 TAA 88 0 TAC 184 0 TAG 115 0 TAT 284 0 TCA 401 0 TCC 162 0 TCG 386 0 TCT 285 0 TGA 392 0 TGC 460 0 TGG 270 0 TGT 155 0 TTA 506 0 TTC 274 0 TTG 632 0 TTT
I hope this bug can be fixed so that scikit-ribo can be run on mm10 data.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/schatzlab/scikit-ribo/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL98wivj7oXg4ucu14SeJREotb58CHvks5uKf9ZgaJpZM4ViQ40 .
I have been trying to use scikit-ribo on mouse ESC Ribo-Seq data but the output files generated from scikit-ribo-build do no correctly parse the given input GFF file. I am using the mm10_knownGene GFF file as input with genome file obtained from http://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/. Specifically, there is inaccurate assignment of codons of transcripts.
For example, for gene uc008cjg.1, the start codon is at chr17: 35916361-35916363 while in the files mm10.codons.df and mm10.codons.bed, this chromosome location is at codon 2
$ cat mm10.codons.bed | grep 'uc008cjg.1' | head -12
Another example where the codon assignments are entirely different from the reference is uc007vnb.1. The start codon is at chr15:37003817-37003820 but the codon file has very different annotations.
$cat mm10.codons.bed | grep 'uc007vnb.1' | head -20
Assuming that codon 0 is the start codon in this file, I found that only 3380(7%) of 46355 transcripts have ATG as start codon
cat mm10.codons.bed | awk '{if($5==0)print $5,$7}' | sort | uniq -c
I hope this bug can be fixed so that scikit-ribo can be run on mm10 data.