Open lparsons opened 8 years ago
Hi Lance,
Thank you for bringing this up. This is due to the fact that the GTFs for E. coli don't include UTR annotations for each open reading frame, which makes metagene
think each mRNA ends at the edge of its corresponding ORF. Because the generate
subprogram works by extending a window upstream and downstream of the start codon, in transcript coordinates (e.g. to account for splicing in the case of eukaryotes), it stops growing the window when it believes the transcript ends (e.g. at ORF boundaries).
I haven't thought of a good, automated way to work around this in the metagene
script, but I do welcome proposals.
In the mean time, one workaround would be to first create a GTF that includes a pseudotranscript of each coding region, containing a simulated 5' UTR of a given length upstream of each coding region, and then run the metagene generate
program on that GTF.
For example:
from plastid import *
new_gtf = open("new_file.gtf","w")
upstream_utr_length = 50 # change to whatever you like
ORFS = GTF2_TranscriptAssembler(open("some_file.gtf"))
for my_orf in ORFS:
span = my_orf.spanning_segment
if my_orf.strand == "+":
new_region = GenomicSegment(span.chrom,span.start - upstream_utr_length,span.end,span.strand)
else:
new_region = GenomicSegment(span.chrom,span.start,span.end+upstream_utr_length,span.strand)
new_transcript = Transcript(new_region,**my_orf.attr) # copy metadata attributes from old ORF
new_gtf.write(new_transcript.as_gtf())
new_gtf.close()
With a few modifications you could make the code more precise, e.g. by trimming the 5' UTR of the pseudotranscript to make sure it doesn't overlap an upstream CDS.
Hope this helps. Please let me know if you encounter further troubles, and/or have suggestions for how to handle this better.
Best, Josh
EDIT: fixed typo in code block above
Thanks so much for the quick reply and workaround. I think the code you posted is a decent workaround for the time being. (One small bug, the 4th argument to GenomicSegment
should be span.strand
and not span.chrom
).
Since this is really an issue of missing information, I think it's important to make the user aware of it. Perhaps an attempt to detect this situation (missing 5' UTR annotations) along with an appropriate warning would be helpful. You could even go so far as to allow the user to set a flag to incorporate the above workaround "automatically" or possibly just have a helper script to simulate utr regions that you suggest using in this case.
Great suggestions- and this is a pretty common situation. It also occurs with the yeast annotation- SGD maintains separate BED files of UTRs from a number of papers, but hasn't incorporated them into the gene models.
I just put warnings into the window generation function, so that if all 5' flanks are 0-length, the user is warned. I'd like something more general for arbitrary landmarks, but this will come later. I'll also think about a command-line switch but will need a little time to think exactly about what the right behaviors would be. In the mean time, I will also put a note into the module documentation and online how-tos.
Thank you again for bringing this up, for writing the helper script, and for catching my typo.
Cheers, Josh
Hi ! I face the exact same problem (I work with yeast from SGD). I simulated UTR with the script you provided but I also can find online GFF annotations of 5UTR and 3UTR. If I want to use them in the plastid pipeline, do I have to merge SGD base GFF file and my UTR annotation files manually ? Or is it possible to add them directly with a plastid option ?
Thank you very much, cheers,
Julien
Hi @jlncrnt ,
To use the 5' and 3' UTR data, you would need to create a new GFF or GTF file containing those, and the coding regions from SGD, in a single file. I did this multiple times when I was in graduate school- typically, I pulled UTRs from two papers, whose data are curated at SGD:
Doing so required a number of steps. I can dig through my old notes, but I recall coordinates for one of these datasets being off-by-one, at least compared to the coding region annotations I had obtained from SGD at the time (~2013). Also, neither of these datasets is aware of 5' UTR introns, which are curated in SGD. So, a loose protocol for assembling these files looks something like:
Check coordinates of UTRs, correct off-by-one errors, et c. Also, may need to lift over thesee coordinates from 2010 genome build to current version (README for datasets say they were last updated in 2010)
Using 5' UTR intron data from SGD dataset, restore introns to the UTRs
For genes whose UTR is not defined in either of the above datasets, consider simulating them by setting to median length
If it would be useful, I'm happy to share the GTF2 I made, but it is worth noting that it was made a number of years ago, and so won't be up to date with more recent coding region annotations.
Cheers, Josh
A bit late from me; thank you for replying !
As a matter of fact, I'm also using the Nagalakshmi 2008 annotations and it would probably be a good idea from SGD to provide a download interface in which we can fetch pre-merged GFF files. For my specific case, I've finally chosen to make very short virtual UTRs because some Nagalakshmi regions are spanning very wide portions of genome that I try to avoid (namely mitochondrial regions).
I still use plastid along with wavClustR and manual counting to get maximum of perspective on my data. So thank you for your work !
Regards,
Julien
You're welcome! I, too, would love it if SGD had a pre-merged file for download. I'm not sure why they haven't merged any UTR annotations into their gene models.
All the best, Josh
Hi @jlncrnt ,
To use the 5' and 3' UTR data, you would need to create a new GFF or GTF file containing those, and the coding regions from SGD, in a single file. I did this multiple times when I was in graduate school- typically, I pulled UTRs from two papers, whose data are curated at SGD:
- http://downloads.yeastgenome.org/published_datasets/Yassour_2009_PMID_19208812/
- http://downloads.yeastgenome.org/published_datasets/Nagalakshmi_2008_PMID_18451266/
Doing so required a number of steps. I can dig through my old notes, but I recall coordinates for one of these datasets being off-by-one, at least compared to the coding region annotations I had obtained from SGD at the time (~2013). Also, neither of these datasets is aware of 5' UTR introns, which are curated in SGD. So, a loose protocol for assembling these files looks something like:
- Check coordinates of UTRs, correct off-by-one errors, et c. Also, may need to lift over thesee coordinates from 2010 genome build to current version (README for datasets say they were last updated in 2010)
- Using 5' UTR intron data from SGD dataset, restore introns to the UTRs
- For genes whose UTR is not defined in either of the above datasets, consider simulating them by setting to median length
If it would be useful, I'm happy to share the GTF2 I made, but it is worth noting that it was made a number of years ago, and so won't be up to date with more recent coding region annotations.
Cheers, Josh
Hi Josh,
I am also facing such confusing problem. Could you share me the GTF2 you made for yeast. I am really appreciate for your help.
Best!
Cheng
I am running into issues attempting to generate ROIs for a p-site offset analysis on e. coli. The generated ROIs do not include any upstream region, and the "alignmnet_offset" seems to always be set to the value of the
--upstream
parameter. I've used various forms of annotation, but here is an example using the RefSeq GTF files from UCSC: