mortazavilab / TALON

Technology agnostic long read analysis pipeline for transcriptomes
MIT License
136 stars 31 forks source link

Exon-based comparison tool #48

Closed gm-nyc closed 4 years ago

gm-nyc commented 4 years ago

Hi, I noticed that there was mention of an exon-based comparison tool but cannot find that option in the talon commands. I am trying to quantify differences in 3' and 5' exon boundaries compared with known isoforms and this would be extremely helpful. Thank you!

dewyman commented 4 years ago

Hi! So to clarify, are you attempting to look at cases where you have novel splice junctions in a known gene, and then see how far they are away from known junctions? I don't think we have an existing formal utility that outputs the splice junctions, but it wouldn't be too difficult to make one!

gm-nyc commented 4 years ago

Yes! I am trying to quantify the distance from the reference splice junctions to the novel junctions I'm seeing in my samples, which have splicing aberrations. The mis-splicing can be transcriptome-wide so I am trying to generate an exon-based matrix for my cells. Does that make sense?

dewyman commented 4 years ago

Yes, absolutely! My colleague and I have put together a utility to extract splice junctions/exon positions as well as the transcripts that contain them. I'm going to run some tests on it and finalize the details, and then once I'm comfortable all is going as intended, I'll let you know so you can try it out!

iam2b commented 4 years ago

Hi, I am very happy to find this tool for my analysis. I tried first run today and found a problem. The error message was "SAM transcript xxx lacks an MD tag". My samples were DirectRNA Nanopore-seq mapped by Minimap2. By the way, will you develop a tool like MISO or rMATs to help detect the change of alternative splicing?

dewyman commented 4 years ago

Hi iam2b, You should be able to fix this issue by running Minimap2 with the --MD flag (see issue #45). Currently we are not in the business of developing our own downstream alt splicing tool, but you might consider trying this one https://bioconductor.org/packages/release/bioc/html/IsoformSwitchAnalyzeR.html. The developer has added support for TALON abundance files.

iam2b commented 4 years ago

Thank you very much. I have sloved this problem. Merry Chrismas!

gm-nyc commented 4 years ago

Hi dewyman,

I wanted to clarify my question a little. The reason I was asking for the distance from the canonical splice junction is that I am trying to identify (and quantify) alternate 3' and 5' splice site usage and thought that the positional information for each junction would be useful since it could be compared with the reference. Thanks for your help and any thoughts/suggestions would be welcome! Hope you're having a good new year.

dewyman commented 4 years ago

Hi! Don't worry, your question makes total sense. We've been working on a utility to help address your question. It's technically complete and passed our tests, but is running slowly so we were hoping to do a bit more work on it to make it run faster. In the meantime though, you're welcome to try it out:

usage: talon_get_sjs [-h] [--gtf GTF] [--db DB] [--ref REF_GTF] [--mode MODE]
                     [--outprefix OUTPREFIX]

Extracts the locations, novelty, and transcript assignments of exons/introns
in a TALON database or GTF file. All positions are 1-based.

optional arguments:
  -h, --help            show this help message and exit
  --gtf GTF             TALON GTF file from which to extract exons/introns
  --db DB               TALON database from which to extract exons/introns
  --ref REF_GTF         GTF reference file (ie GENCODE). Will be used to label
                        novelty.
  --mode MODE           Choices are 'intron' or 'exon' (default is 'intron').
                        Determines whether to include introns or exons in the
                        output
  --outprefix OUTPREFIX
                        Prefix for output file

As a side note, when you run this script in 'intron' mode, the start/end positions currently include the exon base that flanks the intron on each side.

Another approach you might try for extracting splice junctions from a TALON GTF file would be to use the TranscriptClean utility described here. Outputs from this script follow the STAR splice junction output format, which is described in the STAR manual (section 4.4) here.

I hope this helps, but don't hesitate to reach out if you have more questions! Best, Dana

gm-nyc commented 4 years ago

Hi Dana,

Thanks for your help. I am trying to run this script and it's either extremely slow or getting stuck. I subsetted my gtf by chromosome and took the smallest one (chrM in my case, with 48 total lines in the gtf) and the script is still running. Is that the expected speed or do you think there is another issue?

here is the code I'm running:

~/talon/talon-4.4.2/python/bin/talon_get_sjs --gtf ${file} --ref ~/gencode.v31.annotation.gtf --mode intron --outprefix intron

dewyman commented 4 years ago

Thanks for letting us know- we'll look into it some more.

dewyman commented 4 years ago

The reason it's taking so long with your current command is because your --ref file is the entire annotation. If you want to run just chrM, consider subsetting the reference GTF also.

fairliereese commented 4 years ago

Hey, we just fixed how long things were taking. It should run MUCH faster now, and you should be able to do so with full gtfs. Let us know if it's working for you!