griffithlab / regtools

Integrate DNA-seq and RNA-seq data to identify mutations that are associated with regulatory effects on gene expression.
https://regtools.readthedocs.org
MIT License
121 stars 26 forks source link

cis-splice-effects identify junction boundaries differ it TSV and BED output #159

Closed jje42 closed 2 years ago

jje42 commented 2 years ago

I'm using regtools v0.5.2, and I've run the command

regtools cis-splice-effects identify -o out.tsv -v out.vcf -j out.bed -s 1 in.vcf.gz in.bam Homo_sapiens.GRCh37.dna.primary_assembly.fa Homo_sapiens.GRCh37.87.gtf

but when I look at the junctions in the TSV and BED file, I notice that the start/ends differ. For example,

TSV

chrom  start   end     name          score  strand  splice_site  acceptors_skipped  exons_skipped  donors_skipped  anchor  known_donor  known_acceptor  known_junction  gene_names  gene_ids         transcripts                      variant_info
12     402334  404742  JUNC00000004  1      ?       CC-TC        0                  0              0               N       0            0               0               NA          NA               NA                               12:404740-404741
12     402335  404739  JUNC00000005  50     -       GT-AG        0                  0              0               DA      1            1               1               KDM5A       ENSG00000073614  ENST00000382815,ENST00000399788  12:404740-404741
12     404959  406207  JUNC00000006  117    -       GT-AG        0                  0              0               DA      1            1               1               KDM5A       ENSG00000073614  ENST00000382815,ENST00000399788  12:404740-404741
12     405505  446017  JUNC00000007  1      -       CT-AC        20                 15             21              N       0            0               0               NA          NA               NA                               12:404740-404741
12     406366  416112  JUNC00000008  24     -       GT-AG        0                  0              1               DA      1            1               1               KDM5A       ENSG00000073614  ENST00000382815,ENST00000399788  12:404740-404741

BED

12      402254  404761  JUNC00000004    1       ?       402254  404761  255,0,0 2       80,20   0,2487
12      402238  404833  JUNC00000005    50      -       402238  404833  255,0,0 2       97,95   0,2500
12      404866  406305  JUNC00000006    117     -       404866  406305  255,0,0 2       93,99   0,1340
12      405481  446091  JUNC00000007    1       -       405481  446091  255,0,0 2       24,75   0,40535
12      406266  416211  JUNC00000008    24      -       406266  416211  255,0,0 2       100,100 0,9845

The junctions have the same names and number of supporting reads, so I would expect them to have the same start/end positions. Am I misinterpreting what these files represent or is there a reason the start/ends to not match?

Thanks.

malachig commented 2 years ago

Yeah, I think this basically comes down to the relatively simple representation in the TSV (just the two edges of the junction - the last base of the exon on the left, the first base of the exon on the right) vs the BED12 record.

To compare BED12 format to that TSV you need to consider the coordinates in columns 1 (chrom), 2 (chromStart), 3 (chromEnd) AND the information in column 10 (blockCount), 11 (blockSizes), 12 (blockStarts).

Try loading your BED file in IGV. Then compare to the coordinates shown in the TSV manually. Does it make sense then?

On Tue, Mar 1, 2022 at 11:35 PM Jonathan Ellis @.***> wrote:

I'm using regtools v0.5.2, and I've run the command

regtools cis-splice-effects identify -o out.tsv -v out.vcf -j out.bed -s 1 in.vcf.gz in.bam Homo_sapiens.GRCh37.dna.primary_assembly.fa Homo_sapiens.GRCh37.87.gtf

but when I look at the junctions in the TSV and BED file, I notice that the start/ends differ. For example, TSV

chrom start end name score strand splice_site acceptors_skipped exons_skipped donors_skipped anchor known_donor known_acceptor known_junction gene_names gene_ids transcripts variant_info 12 402334 404742 JUNC00000004 1 ? CC-TC 0 0 0 N 0 0 0 NA NA NA 12:404740-404741 12 402335 404739 JUNC00000005 50 - GT-AG 0 0 0 DA 1 1 1 KDM5A ENSG00000073614 ENST00000382815,ENST00000399788 12:404740-404741 12 404959 406207 JUNC00000006 117 - GT-AG 0 0 0 DA 1 1 1 KDM5A ENSG00000073614 ENST00000382815,ENST00000399788 12:404740-404741 12 405505 446017 JUNC00000007 1 - CT-AC 20 15 21 N 0 0 0 NA NA NA 12:404740-404741 12 406366 416112 JUNC00000008 24 - GT-AG 0 0 1 DA 1 1 1 KDM5A ENSG00000073614 ENST00000382815,ENST00000399788 12:404740-404741

BED

12 402254 404761 JUNC00000004 1 ? 402254 404761 255,0,0 2 80,20 0,2487 12 402238 404833 JUNC00000005 50 - 402238 404833 255,0,0 2 97,95 0,2500 12 404866 406305 JUNC00000006 117 - 404866 406305 255,0,0 2 93,99 0,1340 12 405481 446091 JUNC00000007 1 - 405481 446091 255,0,0 2 24,75 0,40535 12 406266 416211 JUNC00000008 24 - 406266 416211 255,0,0 2 100,100 0,9845

The junctions have the same names and number of supporting reads, so I would expect them to have the same start/end positions. Am I misinterpreting what these files represent or is there a reason the start/ends to not match?

Thanks.

— Reply to this email directly, view it on GitHub https://github.com/griffithlab/regtools/issues/159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGRFGGTO54KCEEADR6MY63U534YJANCNFSM5PWIV5JQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

jje42 commented 2 years ago

Yes, I see what you mean. Thanks for clarifying.