Closed dewyman closed 5 years ago
TALON results (which don't look right to me):
grep "c1417/f31p12/2759" test_files/sorted.GM12878_chr1_canonicalOnly.sam > test_files/c1417-f31p12-2759.sam
Run TALON
python talon.py --f test_files/c1417-f31p12-2759.sam -g test_files/gencode.v24.annotation.chr1.gtf
[FIXED] I noticed that the problem is another GTF reading bug. For transcripts on the minus strand, the exons are listed (and numbered) in their 5' to 3' orientation rather than being listed with respect to the forward strand. This is creating problems in my exon string process.
MatchAnnot_file=test_files/Tofu_STAR_MatchAnnot/score_5_transcripts.txt
grep "result" test_files/Tofu_STAR_MatchAnnot/match_annot_results.txt | grep "sc: 5" | awk '{print $2}' | awk -F'|' '{print $NF}' | sort -u > $MatchAnnot_file
TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf | sort > $TALON_file
MatchAnnot_file=test_files/Tofu_STAR_MatchAnnot/score_5_transcripts.txt
TALON_file=test_files/permissive_3_5_transcripts.txt
comm -23 $MatchAnnot_file $TALON_file | wc -l
The result was 13 lines (as of commit dewyman@568667cc4b1eaf59fc7b7eebd302756d5fafb1e), which means that TALON did not quite cover all of the cases that MatchAnnot did. Here are the IDs: c11484/f2p30/3345 c13754/f2p1/1380 c14744/f1p2/3017 c14773/f1p0/3486 c18234/f1p0/3056 c19214/f1p1/3120 c22984/f1p5/3378 c23761/f1p0/2920 c26076/f2p16/2864 c26372/f1p0/3232 c29587/f4p2/3479 c31433/f1p3/3125 c33225/f1p0/3407
Run TALON and save internal match transcripts to file (same as previous)
TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf | sort > $TALON_file
Run MatchAnnot directly on the same data that TALON ran on
MatchAnnot_out=test_files/MatchAnnot_direct/match_annot_results.txt
MatchAnnot_file=test_files/MatchAnnot_direct/score_5_transcripts.txt
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
test_files/sorted.GM12878_chr1_canonicalOnly.sam > \
$MatchAnnot_out
grep "result" $MatchAnnot_out | grep "sc: 5" | awk '{print $2}' | awk -F'|' '{print $NF}' | sort -u > $MatchAnnot_file
Get number of lines in MatchAnnot file that are not in TALON file
comm -23 $MatchAnnot_file $TALON_file | wc -l
The result is 15 lines that are in MatchAnnot but not TALON, and 4 lines that are in TALON but not MatchAnnot. I should look into these cases further. c11484/f2p30/3345 c13754/f2p1/1380 c14744/f1p2/3017 c14773/f1p0/3486 c18234/f1p0/3056 c19214/f1p1/3120 c2229/f33p31/3156 c22984/f1p5/3378 c23761/f1p0/2920 c26076/f2p16/2864 c26372/f1p0/3232 c29587/f4p2/3479 c31433/f1p3/3125 c33225/f1p0/3407 c34432/f1p6/3188
Let's start by investigating the very first one, c11484/f2p30/3345. The MatchAnnot entry for it is:
result: c11484/f2p30/3345 ALDH4A1 ALDH4A1-001 ex: 15 sc: 5 5-3: -2 10
When I print debugging messages for this transcript in TALON, I find out that it was assigned to a different gene, RP13-279N23.2. This strikes me as odd and probably wrong. When I dig deeper, I find that c11484/f2p30/3345 overlapped the following genes, with the amount listed in parantheses. gene_matches:
chr1 HAVANA gene 18871430 18902781 . - .gene_id "ENSG00000159423.16"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "ALDH4A1"; level 1; tag "ncRNA_host"; havana_gene "OTTHUMG00000002443.6";
chr1 HAVANA gene 18849273 18921121 . - .gene_id "ENSG00000255275.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP13-279N23.2"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000167131.3";
This suggests to me that in the rare event of a transcript having the same amount of overlap with 2 genes, it is not sufficient to simply return one of them as I am doing. I checked on the first three cases, on the list and they all come from the same tied gene situation. Clearly this is a problem. I changed the gene fetching function to return more than one gene in case of ties, and repeated the comparison. This time, I got the following 6 matches in MatchAnnot but not in TALON (a subset of the last list):
c14773/f1p0/3486: MIR4425
c18234/f1p0/3056: NAP1L4P1
c22984/f1p5/3378: MIR34A
c23761/f1p0/2920: RP11-61J19.5
c26372/f1p0/3232: ZNF670
c33225/f1p0/3407: PHTF1
To investigate, I start again with the first transcript, c14773/f1p0/3486.
MatchAnnot entry:
result: c14773/f1p0/3486 MIR4425 MIR4425-201 ex: 1 sc: 5 rev 5-3: 5 -3402
Notice that this is a microRNA gene. How long is the transcript? Did we perhaps filter it out with our 200 bp requirement? Checking the length, this does NOT appear to be the case.
grep "c14773/f1p0/3486" test_files/sorted.GM12878_chr1_canonicalOnly.sam | awk '{print length($10)}' 3486
Looking at this area in the UCSC genome browser, I noticed something really interesting that points to a problem with MatchAnnot. MIR4425 is the only gene in the neighborhood here, and it intersects with c14773/f1p0/3486. Since there is only one "exon", MatchAnnot classifies this match as a score 5 because "only the 3' and 5' ends are different", which is a bad move. The correct move would be to create a novel gene.
I found a second problem with MatchAnnot when I examined example c23761/f1p0/2920. MatchAnnot assign the transcript to RP11-61J19.5, but the gene and the transcript are on opposite strands. So I'm comfortable with the fact that TALON does not find a gene match for c23761/f1p0/2920.
Total transcripts processed: 1881 Number of transcripts with internal exon matches (MatchAnnot score 5 equivalent): 964
There are only three discrepancies.
MatchAnnot assigns this single-exon minus strand transcript to MIR4425-001, and TALON finds no transcript match. The reason for this is that MIR4425 is on the plus strand. This discrepancy is acceptable.
MatchAnnot assigns this single-exon plus strand transcript to RP11-61J19.5-001, while TALON reports no match. Again, the reason is a strand discrepancy, so this is ok.
This is a strand discrepancy that was documented in issue #19. This is ok.
There were zero cases like this.
Note: I have seen cases so far where TALON returns more than one transcript match since I haven't implemented a tiebreaker yet. I won't worry too much about this for now.
Run TALON and save internal match transcripts to file (same as previous)
TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf | sort -n -k1,1 > $TALON_file
Run MatchAnnot directly on the same data that TALON ran on
MatchAnnot_out=test_files/MatchAnnot_direct/match_annot_results.txt
MA_transcript_file=test_files/MatchAnnot_direct/score_5_transcripts_and_assignment.txt
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
test_files/sorted.GM12878_chr1_canonicalOnly.sam > \
$MatchAnnot_out
grep "result" $MatchAnnot_out | grep "sc: 5" | awk '{print $2,$4}' | sort -n -k1,1 > $MA_transcript_file
Get number of lines in MatchAnnot file that are not in TALON file
comm -23 $MA_transcript_file $TALON_file | wc -l
c10364/f4p17/3199 TOR1AIP1-015: TALON assigned TOR1AIP1-015,TOR1AIP1-004
c12687/f3p2/3317 RBM15-004: TALON assigned RBM15-001,RBM15-201,RBM15-004
c13055/f1p27/3190 LRRC41-201: TALON assigned LRRC41-201,LRRC41-001,LRRC41-006
c14773/f1p0/3486 MIR4425-201 This was expected
c15401/f82p47/2854 LRRC41-006: TALON assigned LRRC41-201,LRRC41-001,LRRC41-006
c15529/f16p41/2957 LRRC41-201: multiple assignment again
c17240/f4p6/2673 TCEB3-003: multiple assignment
c17998/f1p0/2984 HIST2H2AA4-001: multiple assignment
c17998/f1p0/2984 HIST2H3C-001: multiple assignment
c2355/f16p17/3301 TOR1AIP1-015: multiple assignment
c23761/f1p0/2920 RP11-61J19.5-001 This was expected
c24723/f2p5/4046 ATPAF1-001: multiple assignment
c26869/f1p8/3622 RSBN1-004: multiple assignment
c30828/f2p12/2747 LRRC41-006: multiple assignment
c33225/f1p0/3407 PHTF1-004 This was expected
c36587/f1p0/3228 JMJD4-006: multiple assignment
c8235/f3p8/2930 RSBN1-004: multiple assignment
All of these cases (except for the three enumerated previously) were situations where TALON matched the query to more than one transcript, including the MatchAnnot answer within this set. So that is fine for now. Once I write a tiebreaker for TALON, these discrepancies should disappear.
When I check this, I find only multi-match cases from the previous section. That's a pass
Run TALON and save internal match transcripts to file (same as previous)
TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf --d test --o test
grep -v "dataset" test | grep "known" | awk '{print $2,$10}' | sort -n -k1,1 > $TALON_file
Run MatchAnnot directly on the same data that TALON ran on
MatchAnnot_out=test_files/MatchAnnot_direct/match_annot_results.txt
MA_transcript_file=test_files/MatchAnnot_direct/score_5_transcripts_and_assignment.txt
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
test_files/sorted.GM12878_chr1_canonicalOnly.sam > \
$MatchAnnot_out
grep "result" $MatchAnnot_out | grep "sc: 5" | awk '{print $2,$4}' | sort -n -k1,1 > $MA_transcript_file
Get number of lines in MatchAnnot file that are not in TALON file
comm -23 $MA_transcript_file $TALON_file | wc -l
c14773/f1p0/3486 MIR4425-201
c17998/f1p0/2984 HIST2H2AA4-001
c17998/f1p0/2984 HIST2H3C-001
c23761/f1p0/2920 RP11-61J19.5-001
c33225/f1p0/3407 PHTF1-004
c17998/f1p0/2984 comes up twice here. TALON assigns it to HIST2H3A-001. HIST2H3A-001 and HIST2H2AA4-001 both seem like reasonable choices, and the former minimizes the total 3' and 5' difference, so I'm not going to worry too much about this.
Only one: c17998/f1p0/2984 HIST2H3A-001
Tofu + STAR + MatchAnnot commands:
TALON
Tofu + STAR + MatchAnnot Results:
TALON results (as of commit dewyman@568667cc4b1eaf59fc7b7eebd302756d5fafb1e):