mills-lab / spectre

Spectral coherence classification of actively translated regions in ribosome profiling sequence data.
BSD 3-Clause "New" or "Revised" License
6 stars 3 forks source link

Use SPECtre in predited sequence #7

Closed PSSUN closed 5 years ago

PSSUN commented 6 years ago

Dear author, I want to find out some translation evidence from predicted sequence, such as lincRNA or circRNA, I have got the fasta file of those sequence and their position in genome. The gtf file is required for running SEPCtre. I have made a gtf format file for these sequence, but the program didn't work, is there something wrong with my gtf file?

stonyc commented 6 years ago

Yes, if you add your circRNA predictions into a standard GTF file it should work. Each prediction will require all nine fields as described here: https://www.gencodegenes.org/gencodeformat.html paying particular attention to the descriptive fields in the last column:

  1. Each circRNA prediction will need a unique 'transcript_id' field in the last set of miscellaneous annotation data. This could be as simple as circRNA1, circRNA2, etc.
  2. Each prediction will also require a unique 'gene_id' field in the last set of annotation data, similar to above.
  3. The 'gene_type' (if Ensembl, use 'gene_biotype') field should be set to anything other than 'protein_coding', else the circRNAs will be incorporated into the set of protein-coding genes used to calculate the cutoffs. The list of annotated biotypes is listed here: https://www.gencodegenes.org/gencode_biotypes.html and for your purposes, you could try setting this field to 'sRNA' to start. Since the circRNA predictions will be uniquely named in points 1 and 2 above, you should be able to pull these predictions out if they are scored by SPECtre.
  4. The 'transcript_type' (if Ensembl, use 'transcript_biotype') should be set the same as the values you define in point 3 above.

Your circRNA predictions should be added to a full GTF in order to properly score the circRNAs relative to the scoring cutoffs calculated based on the distribution of SPECtre scores for protein-coding genes. If you add the circRNA predictions as a standalone GTF with no protein-coding genes included, then the code as written will not run properly.

In testing, this has worked for me for scoring of upstream open reading frames, therefore I don't see any issues with this shouldn't also work for circRNA predictions. Please let me know if you run into any troubles with this modification.

PSSUN commented 6 years ago

Dear author, Thank you for taking so much time to give me such a timely and detailed answer! Sorry for reply lately, running all the process costs me lot of time.

I have made a GTF file of predicted sequence with right format as you teach me. And add it to full GTF file with command line: cat predicted.gtf >> At_tair10.gtf

When I checked cufflinks result before running SEPCtre, the file named 'isoforms.fpkm_tracking' contains the predicted sequence, It's exciting that SEPCtre is running without any error. But when I check the result file of SEPCtre, I didn't see any predicted sequence inside ,it's seems that SEPCtre didn't worked when meet the circRNA region.

The each line in my predicted.gtf is this format: 2 araport11 circRNA 17138306 17138573 . + . gene_id "ciR10"; transcript_id "At_ciR10"; exon_number "1"; gene_name "At_ciR10"; gene_source "araport11"; gene_biotype "sRNA"; transcript_source "araport11"; transcript_biotype "sRNA";

Have a nice day!

PSSUN commented 6 years ago

When I used predicted.gtf independently for cufflinks and SPECtre, the error is:

Traceback (most recent call last):
  File "./spectre-master/SPECtre.py", line 1327, in <module>
    transcript_metrics, reference_read_distribution = calculate_transcript_scores(transcript_gtf, transcript_fpkms, float(args.min), asite_buffers, psite_buffers, orfscore_buffers, args.input, int(args.len), int(args.step), args.type, analyses, offsets, target_chroms, int(args.nt))
  File "./spectre-master/SPECtre.py", line 840, in calculate_transcript_scores
    transcripts, intervals = zip(*flatten(gtf).iteritems())
ValueError: need more than 0 values to unpack

just like issues#6