lavenderca / TSScall

TSScall identifies transcription start sites (TSSs) from Start-seq data (Nechaev et al. Science, 2010). Operating both with and without a reference annotation, TSScall allows for rapid annotation of TSSs across an entire genome.
MIT License
5 stars 1 forks source link

what's the format of annotation reference file? #6

Closed cxue closed 6 years ago

cxue commented 6 years ago

Hi, Lavenderca, I like this program. I have a question when I use it. I use GENCODE M15 as annotation reference file in format gtf: the format is: chr1 HAVANA gene 3073253 3074322 . + . gene_id "ENSMUSG000001026 93.1"; gene_type "TEC"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG000000499 35.1"; chr1 HAVANA transcript 3073253 3074322 . + . gene_id "ENSMUSG0 0000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_name "4933401J 01Rik"; transcript_type "TEC"; transcript_name "4933401J01Rik-201"; level 2; transcript_s upport_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OT TMUST00000127109.1";

But when I run TSScall (my command is: python TSScall.py -a gencode.vM15.annotation.gtf data.forward.mm10.bed data.reverse.mm10.bed mm10_chrom.sizes.txt TSS.annotated.bed), I got the error message: Reading in bedGraph files... Calculating read threshold... Read threshold set to 3 Reading in annotation file... Traceback (most recent call last): File "TSScall.py", line 977, in TSSCalling(**vars(args)) File "TSScall.py", line 182, in init self.execute() File "TSScall.py", line 904, in execute readInReferenceAnnotation(self.annotation_file) File "TSScall.py", line 72, in readInReferenceAnnotation attributes = line.strip().split('\t') ValueError: need more than 1 value to unpack

Could you help me? Thanks

Cheng Xue

lavenderca commented 6 years ago

Hi, Cheng!

It looks like your GTF file is space-delimited, not tab-delimited. Is this true? If so, that would result in that error. Was the GTF file downloaded directly from GENCODE or from another source?

Thanks!

Andy

cxue commented 6 years ago

Hi, Andy, The GTF is tab-delimited. Please see the following command: [cxue@smvxu TSScall-master]$ awk -F "\t" '{print $1"\t"$4;}' gencode.vM15.annotation.gtf|more chr1 3073253 chr1 3073253 chr1 3073253 chr1 3102016 chr1 3102016

The gencode.vM15.annotation.gtf is download from https://www.gencodegenes.org/mouse_releases/. And I also checked this with ensemble gtf (http://aug2017.archive.ensembl.org/info/data/ftp/index.html), I got the same message.

thanks Cheng

cxue commented 6 years ago

Hi, Andy, When I removed the last several lines in gtf files, I got the following error messages: Reading in bedGraph files... Calculating read threshold... Read threshold set to 3 Reading in annotation file... Traceback (most recent call last): File "TSScall.py", line 977, in TSSCalling(**vars(args)) File "TSScall.py", line 182, in init self.execute() File "TSScall.py", line 904, in execute readInReferenceAnnotation(self.annotation_file) File "TSScall.py", line 79, in readInReferenceAnnotation values.append(entry.split('\"')[1].strip()) IndexError: list index out of range

thanks
Cheng
cxue commented 6 years ago

Hi Andy, I see what's wrong. You did not provide some checks if there is no " in the field, in such case of "level 2;". And also you should set some default values to some important keys, such as transcript_id. Otherwise, the user will not run successfully.

 best
 Cheng
lavenderca commented 6 years ago

Hi, Cheng!

I fixed the code so it should now work with your GTF file. Please run using the version from the latest commit.

Thanks!

Andy