Feature request: create / simulate UTRs in `metagene generate` when annotations do not include them. (formerly: Unable to generate proper cds_start ROIs using `metagene generate`)

lparsons commented 8 years ago

I am running into issues attempting to generate ROIs for a p-site offset analysis on e. coli. The generated ROIs do not include any upstream region, and the "alignmnet_offset" seems to always be set to the value of the --upstream parameter. I've used various forms of annotation, but here is an example using the RefSeq GTF files from UCSC:

region_id   window_size region  masked  alignment_offset    zero_point
b0001   100 chr:189-239(+)  na  50  50
b0002   100 chr:336-386(+)  na  50  50

joshuagryphon commented 8 years ago

Hi Lance,

Thank you for bringing this up. This is due to the fact that the GTFs for E. coli don't include UTR annotations for each open reading frame, which makes metagene think each mRNA ends at the edge of its corresponding ORF. Because the generate subprogram works by extending a window upstream and downstream of the start codon, in transcript coordinates (e.g. to account for splicing in the case of eukaryotes), it stops growing the window when it believes the transcript ends (e.g. at ORF boundaries).

I haven't thought of a good, automated way to work around this in the metagene script, but I do welcome proposals.

In the mean time, one workaround would be to first create a GTF that includes a pseudotranscript of each coding region, containing a simulated 5' UTR of a given length upstream of each coding region, and then run the metagene generate program on that GTF.

For example:

from plastid import *

new_gtf = open("new_file.gtf","w")
upstream_utr_length = 50 # change to whatever you like

ORFS = GTF2_TranscriptAssembler(open("some_file.gtf"))
for my_orf in ORFS:
    span = my_orf.spanning_segment
    if my_orf.strand == "+":
        new_region = GenomicSegment(span.chrom,span.start - upstream_utr_length,span.end,span.strand)
    else:
        new_region = GenomicSegment(span.chrom,span.start,span.end+upstream_utr_length,span.strand)  
    new_transcript = Transcript(new_region,**my_orf.attr) # copy metadata attributes from old ORF
    new_gtf.write(new_transcript.as_gtf())

new_gtf.close()

With a few modifications you could make the code more precise, e.g. by trimming the 5' UTR of the pseudotranscript to make sure it doesn't overlap an upstream CDS.

Hope this helps. Please let me know if you encounter further troubles, and/or have suggestions for how to handle this better.

Best, Josh

EDIT: fixed typo in code block above

lparsons commented 8 years ago

Thanks so much for the quick reply and workaround. I think the code you posted is a decent workaround for the time being. (One small bug, the 4th argument to GenomicSegment should be span.strand and not span.chrom).

Since this is really an issue of missing information, I think it's important to make the user aware of it. Perhaps an attempt to detect this situation (missing 5' UTR annotations) along with an appropriate warning would be helpful. You could even go so far as to allow the user to set a flag to incorporate the above workaround "automatically" or possibly just have a helper script to simulate utr regions that you suggest using in this case.

joshuagryphon commented 8 years ago

Great suggestions- and this is a pretty common situation. It also occurs with the yeast annotation- SGD maintains separate BED files of UTRs from a number of papers, but hasn't incorporated them into the gene models.

I just put warnings into the window generation function, so that if all 5' flanks are 0-length, the user is warned. I'd like something more general for arbitrary landmarks, but this will come later. I'll also think about a command-line switch but will need a little time to think exactly about what the right behaviors would be. In the mean time, I will also put a note into the module documentation and online how-tos.

Thank you again for bringing this up, for writing the helper script, and for catching my typo.

Cheers, Josh

jlncrnt commented 7 years ago

Hi ! I face the exact same problem (I work with yeast from SGD). I simulated UTR with the script you provided but I also can find online GFF annotations of 5UTR and 3UTR. If I want to use them in the plastid pipeline, do I have to merge SGD base GFF file and my UTR annotation files manually ? Or is it possible to add them directly with a plastid option ?

Thank you very much, cheers,

Julien

joshuagryphon commented 7 years ago

Hi @jlncrnt ,

To use the 5' and 3' UTR data, you would need to create a new GFF or GTF file containing those, and the coding regions from SGD, in a single file. I did this multiple times when I was in graduate school- typically, I pulled UTRs from two papers, whose data are curated at SGD:

Doing so required a number of steps. I can dig through my old notes, but I recall coordinates for one of these datasets being off-by-one, at least compared to the coding region annotations I had obtained from SGD at the time (~2013). Also, neither of these datasets is aware of 5' UTR introns, which are curated in SGD. So, a loose protocol for assembling these files looks something like:

Check coordinates of UTRs, correct off-by-one errors, et c. Also, may need to lift over thesee coordinates from 2010 genome build to current version (README for datasets say they were last updated in 2010)
Using 5' UTR intron data from SGD dataset, restore introns to the UTRs
For genes whose UTR is not defined in either of the above datasets, consider simulating them by setting to median length

If it would be useful, I'm happy to share the GTF2 I made, but it is worth noting that it was made a number of years ago, and so won't be up to date with more recent coding region annotations.

Cheers, Josh

jlncrnt commented 7 years ago

A bit late from me; thank you for replying !

As a matter of fact, I'm also using the Nagalakshmi 2008 annotations and it would probably be a good idea from SGD to provide a download interface in which we can fetch pre-merged GFF files. For my specific case, I've finally chosen to make very short virtual UTRs because some Nagalakshmi regions are spanning very wide portions of genome that I try to avoid (namely mitochondrial regions).

I still use plastid along with wavClustR and manual counting to get maximum of perspective on my data. So thank you for your work !

Regards,

Julien

joshuagryphon commented 7 years ago

You're welcome! I, too, would love it if SGD had a pre-merged file for download. I'm not sure why they haven't merged any UTR annotations into their gene models.

All the best, Josh

Wang497 commented 5 years ago

Hi @jlncrnt ,

To use the 5' and 3' UTR data, you would need to create a new GFF or GTF file containing those, and the coding regions from SGD, in a single file. I did this multiple times when I was in graduate school- typically, I pulled UTRs from two papers, whose data are curated at SGD:

http://downloads.yeastgenome.org/published_datasets/Yassour_2009_PMID_19208812/

http://downloads.yeastgenome.org/published_datasets/Nagalakshmi_2008_PMID_18451266/

Doing so required a number of steps. I can dig through my old notes, but I recall coordinates for one of these datasets being off-by-one, at least compared to the coding region annotations I had obtained from SGD at the time (~2013). Also, neither of these datasets is aware of 5' UTR introns, which are curated in SGD. So, a loose protocol for assembling these files looks something like:

Check coordinates of UTRs, correct off-by-one errors, et c. Also, may need to lift over thesee coordinates from 2010 genome build to current version (README for datasets say they were last updated in 2010)

Using 5' UTR intron data from SGD dataset, restore introns to the UTRs

For genes whose UTR is not defined in either of the above datasets, consider simulating them by setting to median length

If it would be useful, I'm happy to share the GTF2 I made, but it is worth noting that it was made a number of years ago, and so won't be up to date with more recent coding region annotations.

Cheers, Josh

Hi Josh, I am also facing such confusing problem. Could you share me the GTF2 you made for yeast. I am really appreciate for your help.
Best! Cheng

joshuagryphon / plastid

Feature request: create / simulate UTRs in `metagene generate` when annotations do not include them. (formerly: Unable to generate proper cds_start ROIs using `metagene generate`) #3