griffithlab / regtools

Integrate DNA-seq and RNA-seq data to identify mutations that are associated with regulatory effects on gene expression.
https://regtools.readthedocs.org
MIT License
120 stars 26 forks source link

Can't seem to find matching junctions in normal #62

Closed yang-yangfeng closed 7 years ago

yang-yangfeng commented 7 years ago

So here's one example of the thing I was talking about where you'll look at the sashimi plot for the tumor and see a reported junction and then look at the normal and see a junction that looks essentially the same. We've been sort of looking at the normal as a control and so we'd want to know exactly what we're comparing.

screen shot 2017-03-30 at 5 47 35 pm

So I ran junctions extract on the normal rna bam, which gets all the junctions it can infer from the alignment. I looked for the junction in the normal that looked the same on the sashimi plot.

Tumor: 1 93696466 93697982 JUNC00000015 (score: 6) Normal: 1 93696438 93698106 JUNC00016521 (score: 3)

So clearly close enough to be visually mistaken as the same junction on the sashimi plot, but I don't know if it's close enough that we would be comfortable using this as a comparator.

I wrote an r script to iterate through all of the 80 or so aberrant junctions reported for the hcc1395 tumor sample and check to see if they existed the list reported by junctions extract for the normal. With 0 laxness (looking for exact matches), I found no matches. If I allowed the start and stop positions to each be off by at most 50, I was able to find a couple matches.

Obviously, we shouldn't expect to find all or even most of our aberrant junctions in the normal sample, but we should expect to see at least some - and I think we need to make sure that the coordinates that are being reported by junctions extract/cis-splice-effects identify are what we want.

Indeed, I was actually able to find the reads in the tumor and the normal that corresponded to the visually identical junctions. It looks like the junctions actually ARE identical, according to the reads - the reference span is the same distance and they both line up with the junction track at the bottom. However, the coordinates are some how different.

screen shot 2017-03-30 at 7 10 26 pm
malachig commented 7 years ago

In general we want the comparator junction to be exactly the same.

However, if the junction coordinates for the junction reported by regtools do not make sense for the actual read alignments we are seeing, that sounds like a bug. Needs further investigation...

On Thu, Mar 30, 2017 at 7:23 PM, Yang-Yang Feng notifications@github.com wrote:

So here's one example of the thing I was talking about where you'll look at the sashimi plot for the tumor and see a reported junction and then look at the normal and see a junction that looks essentially the same. We've been sort of looking at the normal as a control and so we'd want to know exactly what we're comparing.

[image: screen shot 2017-03-30 at 5 47 35 pm] https://cloud.githubusercontent.com/assets/16638655/24531183/9f9714ee-157c-11e7-996f-d147f2c9bd70.png

So I ran junctions extract on the normal rna bam, which gets all the junctions it can infer from the alignment. I looked for the junction in the normal that looked the same on the sashimi plot.

Tumor: 1 93696466 93697982 JUNC00000015 (score: 6) Normal: 1 93696438 93698106 JUNC00016521 (score: 3)

So clearly close enough to be visually mistaken as the same junction on the sashimi plot, but I don't know if it's close enough that we would be comfortable using this as a comparator.

I wrote an r script to iterate through all of the 80 or so aberrant junctions reported for the hcc1395 tumor sample and check to see if they existed the list reported by junctions extract for the normal. With 0 laxness (looking for exact matches), I found no matches. If I allowed the start and stop positions to each be off by at most 50, I was able to find a couple matches.

Obviously, we shouldn't expect to find all or even most of our aberrant junctions in the normal sample, but we should expect to see at least some - and I think we need to make sure that the coordinates that are being reported by junctions extract/cis-splice-effects identify are what we want.

Indeed, I was actually able to find the reads in the tumor and the normal that corresponded to the visually identical junctions. It looks like the junctions actually ARE identical, according to the reads - the reference span is the same distance and they both line up with the junction track at the bottom. However, the coordinates are some how different.

[image: screen shot 2017-03-30 at 7 10 26 pm] https://cloud.githubusercontent.com/assets/16638655/24531380/5a75753e-157e-11e7-9206-c9b7d9f0c4d9.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/griffithlab/regtools/issues/62, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0SmN3Ydtah33vxF8KWxPdx4ZNq9Qk-ks5rrEeegaJpZM4MvFKI .

yang-yangfeng commented 7 years ago

This is my mistake. The coordinates that junctions extract outputs are of the whole read (including the exonic overhangs), whereas cis-splice-effects identify/junctions annotate outputs the actual coordinates of the junction.

I bet if I run junctions annotate on the normal junctions extract bed I'll be able to find those junctions.

yang-yangfeng commented 7 years ago

So to be clear, cis-splice-effects identify/junctions annotate outputs the positions of the last base of the first exon and the first base of the second exon. This is why when we first loaded these junctions into IGV, we were getting the 1 base overhang into the second exon.

In the read the docs for this output, it says: start Junction start co-ordinate. [zero based format] end Junction end co-ordinate. [zero based format]

If I understand correctly, we ought to be subtracting 1 from the end coordinate that's currently being output?