RNAfold / pre-built files for human

matteobenelli commented 6 years ago

Hi, I'm testing scikit-ribo on a set of ribo-seq experiments from human cancer cell lines. I am experiencing the same issue reported by @aemoor some months ago while running gtf_preprocess.py (see https://github.com/schatzlab/scikit-ribo/issues/2) (error reported below).

[status] Reading the input file: /elaborazioni/sharedCO/SU2C_Data/scores_SU2C_v2/stuff/Homo_sapiens.GRCh37.75.gtf [execute] Starting the pre-processing module [execute] Loading the the gtf file in to sql db sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. Traceback (most recent call last): File "scikit_ribo/gtf_preprocess.py", line 268, in worker.convertGtf() File "scikit_ribo/gtf_preprocess.py", line 55, in convertGtf geneBed12 = self.db.bed12(geneName, name_field='gene_id') File "/home/matteo.benelli/.local/lib/python3.5/site-packages/gffutils/interface.py", line 1218, in bed12 % (last, feature.stop)) ValueError: End of last exon (39127564) does not match end of feature (39127595)

Did you solve that issue? It would be great if you could share pre-built data for human model (Grch37). Thanks!

mschatz commented 6 years ago

Hi Matteo,

Sorry for the delay on this. There is a known defect in the gffutils code that we use for processing gtf files where it doesnt know how to deal with untranslated regions at the end of transcripts correctly (see https://github.com/daler/gffutils/issues/96). We reached out to the author and even provided code for him, but he has yet to apply the fix. In parallel we are investigating potential workaround but might take a while to get there.

Thank you,

Mike

On Wed, Feb 7, 2018 at 4:51 PM, Matteo Benelli notifications@github.com wrote:

Hi, I'm testing scikit-ribo on a set of ribo-seq experiments from human cancer cell lines. I am experiencing the same issue reported by @aemoor https://github.com/aemoor some months ago while running gtf_preprocess.py (see #2 https://github.com/schatzlab/scikit-ribo/issues/2) (error reported below).

[status] Reading the input file: /elaborazioni/sharedCO/SU2C_ Data/scores_SU2C_v2/stuff/Homo_sapiens.GRCh37.75.gtf [execute] Starting the pre-processing module [execute] Loading the the gtf file in to sql db sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. Traceback (most recent call last): File "scikit_ribo/gtf_preprocess.py", line 268, in worker.convertGtf() File "scikit_ribo/gtf_preprocess.py", line 55, in convertGtf geneBed12 = self.db.bed12(geneName, name_field='gene_id') File "/home/matteo.benelli/.local/lib/python3.5/site-packages/gffutils/interface.py", line 1218, in bed12 % (last, feature.stop)) ValueError: End of last exon (39127564) does not match end of feature (39127595)

Did you solve that issue? It would be great if you could share pre-built data for human model (Grch37). Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/schatzlab/scikit-ribo/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL98ySwJFYtHJQ4rmZMdvqmudo3Qu7Cks5tShrHgaJpZM4R9ct4 .

elabaronne commented 5 years ago

Hi all, I had the same issue using a gtr from the mouse (mm10) and I find a way to correct that. You can build the model if the gtf had 1 transcript per gene. For that I select first transcript that are tagged as "appris_principal_1". For several genes (~ 3000 in mm10) I still have 2 transcript per gene and so I select the longest transcript using cgat toolset. There is the code that I use :

awk -F "\t" ' { if ($3 ~ /gene/) { 
        geneline=$0 ;
    }
        else if ($3 ~ /transcript/ && $9 ~ /appris_principal_1/) {
            print geneline"\n"$0 ;
        }
        else if ($9 ~ /appris_principal_1/) {
            print $0 ;
        }
 }' gencode.vM18.annotation.gtf > gencode.vM18.annotation_appris_prinp_1.gtf 

cat gencode.vM18.annotation_appris_prinp_1.gtf  | cgat gtf2gtf --method=filter --filter-method=longest-transcript > gencode.vM18.annotation_1transcriptPerGene.gtf

I hope this helps,

mschatz commented 5 years ago

Great tip! Does that pass the scikit ripo validation? It does some sanity checking on stop codons etc in the sequences

Mike

On Wed, Oct 17, 2018 at 6:36 AM elabaronne notifications@github.com wrote:

Hi all, I had the same issue using a gtr from the mouse (mm10) and I find a way to correct that. You can build the model if the gtf had 1 transcript per gene. For that I select first transcript that are tagged as "appris_principal_1". For several genes (~ 3000 in mm10) I still have 2 transcript per gene and so I select the longest transcript using cgat toolset https://www.cgat.org/downloads/public/cgat/documentation/. There is the code that I use :

awk -F "\t" ' { if ($3 ~ /gene/) { geneline=$0 ; } else if ($3 ~ /transcript/ && $9 ~ /appris_principal_1/) { print geneline"\n"$0 ; } else if ($9 ~ /appris_principal_1/) { print $0 ; } }' gencode.vM18.annotation.gtf > gencode.vM18.annotation_appris_prinp_1.gtf

cat gencode.vM18.annotation_appris_prinp_1.gtf | cgat gtf2gtf --method=filter --filter-method=longest-transcript > gencode.vM18.annotation_1transcriptPerGene.gtf

I hope this helps,

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/scikit-ribo/issues/3#issuecomment-430578204, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL984Gtpl92iuSsknKogKY65qgK1Vqtks5ulwgTgaJpZM4R9ct4 .

elabaronne commented 5 years ago

Doing that, I am able to run scikit ribo build without any errors. But I have the same issue as mentioned in the issue #7 (Incorrect codon files generated by scikit-ribo-build for mm10) where the codon 0 in the file mm10.codons.bed correspond to the first triplet of the transcript (so the first nucleotides of the 5'UTR) instead of the start codon.

By the way, I don't know if it is normal, but doing the build step, I need a particular computer with 786 Go of RAM and the job ran 23h34min. As we build a model per sample, it is very time consuming... Any way to parallelize the job ? Or may be it can be judicious to build an model with a gtf that contain only expressed genes (that may reduce the number of gene to ~ 6000 - 7000 instead of 17978 !!)

mschatz commented 5 years ago

Is there a way to massage the annotation to only include the CDS and skip the UTR? That might be the most practical approach to get this run now. And yes, it would make sense to skip any genes that have zero or very low amounts of expression.

Thanks!

Mike

On Thu, Oct 18, 2018 at 4:29 AM elabaronne notifications@github.com wrote:

Doing that, I am able to run scikit ribo build without any errors. But I have the same issue as mentioned in the issue #7 https://github.com/schatzlab/scikit-ribo/issues/7 (Incorrect codon files generated by scikit-ribo-build for mm10) where the codon 0 in the file mm10.codons.bed correspond to the first triplet of the transcript (so the first nucleotides of the 5'UTR) instead of the start codon.

By the way, I don't know if it is normal, but doing the build step, I need a particular computer with 786 Go of RAM and the job ran 23h34min. As we build a model per sample, it is very time consuming... Any way to parallelize the job ? Or may be it can be judicious to build an model with a gtf that contain only expressed genes (that may reduce the number of gene to ~ 6000 - 7000 instead of 17978 !!)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/scikit-ribo/issues/3#issuecomment-430921849, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL980iCrIpS75zpXK15289325Y6DHWxks5umDvagaJpZM4R9ct4 .

drewjbeh commented 5 years ago

Hi there,

I am also battling with the same gtf problems as others. I am trying to work with human ribosome profiling data.

I managed to get scikit-ribo-build to run by removing UTRs from the annotation, as you suggested @mschatz, and by using one transcript per gene as suggested by @elabaronne. However, the annotation still contains exon entries that overlap with UTRs which leads to incorrect codon assignment as mentioned in #7. Removing exons and leaving CDSs only means that gffutils fails with the following error gffutils.exceptions.FeatureNotFoundError: ENSG00000187634.11 So exons are needed in the gtf. An idea I had was to go through the gtf and adjust the first exon of every transcript to start only at the coordinates of the start codon and not before. This combined with removing UTR annotations should work - I need to write the script and test though.

The other more concerning issue I'm having is that scikit-ribo does not handle large genomes well. In the "Building the index for each position at the codon level" step of scikit-ribo-build, the server I am using eventually runs out of memory (256Gb) and kills the job. I got around this by testing on a subset of genes from one chromosome, and only using that chromosome as input fasta and it worked (albeit with the incorrect codon assignment issue). This is something I do no know how to fix, even if I did know the code of scikit-ribo well enough to know where to start looking. Any suggestions?

Thanks Drew

drewjbeh commented 5 years ago

I have just tested building the index using the human genome on a server with 512Gb of memory and the job was killed as well after about 12 hours. Even if I fix the problems with the gtf and codon assignment, this problem is going to stop me from being able to use scikit-ribo. Is there anything you can suggest to get around this, or is there active development on scikit-ribo to fix these problems?

elabaronne commented 5 years ago

Hi drewjbeh, JusI had the same issue (out of memory) using a computer with 256 Go and 384 Go of RAM. But it works with 768 Go of RAM (It takes 23 hours...). As I suggested before I could be interesting to keep only expressed genes in the gtf to limit the time and the memory used by the model building step (I do not have enough time to try this method for the moment...)

elabaronne commented 5 years ago

Hi drewjbeh, JusI had the same issue (out of memory) using a computer with 256 Go and 384 Go of RAM. But it works with 768 Go of RAM (It takes 23 hours...). As I suggested before I could be interesting to keep only expressed genes in the gtf to limit the time and the memory used by the model building step (I do not have enough time to try this method for the moment...)

mt1022 commented 5 years ago

I have made some modifications of the scripts in scikit_ribo folder to use with fly genome, some of which are based on the comments of @mschatz, @drewjbeh and other users here. Hope this helps.

drewjbeh commented 5 years ago

@mt1022 this sounds great. Could you give a quick outline of the changes and whether you think these will help generally for other bigger genomes?

Is running the analysis the same as in the original version?

mt1022 commented 5 years ago

@drewjbeh, Sure. Here are some tips to run scikit-ribo for other genomes:

The scikit-ribo is design for analysis of yeast data, which does not have UTR annotation. In order to run with other genomes, the boundaries of exons in gtf should be confined to CDS regions (including stop codons). This can be generated by merging CDS and stop codons intervals of same transcript (GRangeList in R) with GenomicRanges::reduce.
For each gene, I only used the CDS annotation of the major isoform (most abundant one in kallisto results). The tpm of all isoforms of the same gene were aggregated and used for build.
To run call_rnafold.py, the expanded CDS fasta can be first created with scikit-ribo-build. Break the execution when the prefix.expand.CDS.fa was created. And then run call_rnafold.py. It is very slow for long CDSs so that I excluded CDSs longer than 10k nt (will remove dozens of genes).
the fasta header line generated with bedtools getfasta is different depend on the exact version. The faiter in gtf_preprocess.py should be modified correspondingly.
Although sorted bam ared used as input for scikit-ribo-run.py, chromosomes in bam are usually sorted based on their appearance in the referene fasta rather alphabetically. This will trigger errors when executing bam_process.py. The error is fixed by adding an additional step of sorting.
The build step takes huge memory. I have several samples and have to run the build step one by one. The build step will crash if I run several build steps simoutaneously.

I firsted ran with the original scripts on yeast data (Weinberg 2016) and got almost the same results as shown in scikit-ribo paper. Changes mentioned above are used to make sure the input files are parsed properly and do not affect any computational step. The modified scripts generated the same results as the original version.

I am not sure whether all of those changes are approriate. Any comments and suggestions are appreciated.

Currently, I have difficulty in executing the scikit-ribo-run.py with -c parameter. log2_TE are all zeros when -c is used.

schatzlab / scikit-ribo

RNAfold / pre-built files for human #3