griffithlab / regtools

Integrate DNA-seq and RNA-seq data to identify mutations that are associated with regulatory effects on gene expression.
https://regtools.readthedocs.org
MIT License
121 stars 26 forks source link

Junction extract: different junction start/end sites detected using short and long reads #165

Open ErminZ opened 2 years ago

ErminZ commented 2 years ago

Hello,

I have a question about junction extract, different junctions are identified between Regtools and NanoSplicer. Would you help me to understand why? Thank you!

The input file is the same BAM file generated by minimap2 for Nanopore data.

Regtools output file ref 1180 3212 JUNC00000028 137 ? 1180 3212 255,0,0 2 99,51 0,1981 ref 1180 3245 JUNC00000017 658 ? 1180 3245 255,0,0 2 99,87 0,1978 ref 1180 6541 JUNC00000030 377 ? 1180 6541 255,0,0 2 99,71 0,5290 ref 1180 6608 JUNC00000020 2705 ? 1180 6608 255,0,0 2 99,141 0,5287 ref 1198 4144 JUNC00000033 66 ? 1198 4144 255,0,0 2 81,46 0,2900 ref 1207 5330 JUNC00000029 89 ? 1207 5330 255,0,0 2 72,134 0,3989 ref 1241 5191 JUNC00000117 1 ? 1241 5191 255,0,0 2 38,15 0,3935 ref 1252 6485 JUNC00000094 1 ? 1252 6485 255,0,0 2 18,18 0,5215 ref 1260 1433 JUNC00000042 1 ? 1260 1433 255,0,0 2 19,24 0,149 ref 1267 1710 JUNC00000095 2 ? 1267 1710 255,0,0 2 12,29 0,414 ref 1977 2486 JUNC00000163 3 ? 1977 2486 255,0,0 2 26,15 0,494 ref 2009 2518 JUNC00000168 3

From NanoSplicer (1279, 6467) | 2681 (number of reads supporting the junction) (1279, 3158) | 650 (1279, 6470) | 372 (1283, 6467) | 78

I found a read in this BAM file as an example: 5b273de4-fcdd-4c4a-91a3-0e6edd27e2cf 0 Ad-ED90A1.4.1.6_CMV-Wuhan_Stabilized_S-BGH-CMV-Luc 1178 60 32=1D37=1D31=5188N71=1D11=1D52=2D15=1I17=1I1=1X61=1X14=2D3=1I30=1I6=1I5=1D23=2I42=4I5=2D45=1D17=1X15=1D12=1D15=6S * 0 0

Based on the CIGAR, it should be a (1279, 6467). However, no junction (1279, 6467) in Regtools output junc file, is shown above. Would you help to point out what is causing this problem?

The Illumina data from the same sample also gives different junctions compared to Nanopore sequencing. That's why I want to understand why. ref 1154 6614 JUNC00000057 4515 + 1154 6614 255,0,0 2 125,147 0,5313 ref 1166 3305 JUNC00000228 932 + 1166 3305 255,0,0 2 113,147 0,1992 ref 1177 6617 JUNC00000232 4481 + 1177 6617 255,0,0 2 102,147 0,5293 ref 1178 3308 JUNC00000083 256 + 1178 3308 255,0,0 2 101,147 0,1983 ref 1178 5343 JUNC00000234 226 + 1178 5343 255,0,0 2 101,147 0,4018 ref 1180 1803 JUNC00000237 13 + 1180 1803 255,0,0 2 99,122 0,501 ref 1180 4245 JUNC00000236 175 + 1180 4245 255,0,0 2 99,147 0,2918 ref 1180 5210 JUNC00000240 469 + 1180 5210 255,0,0 2 99,147 0,3883 ref 1182 6608 JUNC00000243 126 + 1182 6608 255,0,0 2 81,141 0,5285

Thank you in advance!

Best, Ermin

kcotto commented 1 year ago

I think this issue is related to how different tools represent junctions differently. For the Regtools output file you're referencing, which command did you run?

I'm pretty sure it's coming from junctions extract, which encodes some visualization information into the bed file but if you want to match coordinates to NanoSplicer, you could use the information to get there.

Alternatively, if you run junctions annotate, I think you'll get values similar to NanoSplicer.