tony-kuo / eagle

GNU General Public License v3.0
17 stars 5 forks source link

0 Successfully assigned alignments from featureCounts #8

Closed chongjing closed 3 months ago

chongjing commented 3 months ago

https://github.com/tony-kuo/eagle/blob/d6fb2e97be8515e2f7057a7d5f613bab65848d77/scripts/octoploid_analysis.sh#L162

Hi, @tony-kuo

After running your octoploid analysis pipeline, I got an issue at the very last step: I got 0% of reads successfully assigned to specific subgenome.

I realize this eagle/$F.chrA.ref.bam file was generated from previous eagle-rc --refonly --readlist -a chrA/$F.refsort.bam -o eagle/$F.chrA eagle/$F.ref.chrA.list step, and it seems the reference of eagle/$F.chrA.ref.bam is the transcriptome of chrA. However, in your featureCounts code featureCounts -T 8 -t exon -g transcript_id -a refseq.chrA.gtf -o eagle/$F.chrA.counts.txt eagle/$F.chrA.ref.bam, the reference of the annotation file refseq.chrA.gtf is genome of chrA. For example, header of eagle/$F.chrA.ref.bam looks like:

@SQ     SN:g1.t1        LN:615
@SQ     SN:g1.t2        LN:579
@SQ     SN:g2.t1        LN:1566
@SQ     SN:g3.t1        LN:879
@SQ     SN:g4.t1        LN:213
@SQ     SN:g5.t1        LN:435
@SQ     SN:g6.t1        LN:156
@SQ     SN:g7.t1        LN:153
@SQ     SN:g8.t1        LN:924

while header of refseq.chrA.gtf looks like:

ChrA_1       AUGUSTUS        exon    57983   58351   .       +       .       transcript_id "g1.t1"; gene_id "g1";
ChrA_1       AUGUSTUS        exon    58464   58490   .       +       .       transcript_id "g1.t1"; gene_id "g1";
ChrA_1       AUGUSTUS        exon    58591   58663   .       +       .       transcript_id "g1.t1"; gene_id "g1";
ChrA_1       AUGUSTUS        exon    58751   58896   .       +       .       transcript_id "g1.t1"; gene_id "g1";
ChrA_1       AUGUSTUS        CDS     57983   58351   .       +       0       transcript_id "g1.t1"; gene_id "g1";
ChrA_1       AUGUSTUS        CDS     58464   58490   .       +       0       transcript_id "g1.t1"; gene_id "g1";
ChrA_1       AUGUSTUS        CDS     58591   58663   .       +       0       transcript_id "g1.t1"; gene_id "g1";

My question is, does that mean the eagle/$F.chrA.ref.bam and the refseq.chrA.gtf use different reference? is this the reason why I got nothing assigned to subgenome? How to solve this? Please let me know if you need more information. Thanks

Chongjing

tony-kuo commented 3 months ago

The reference look to be the same.

Can you confirm that the BAM file has reads? What does the output in $F.ref.chrA.list look like?

Unfortunately the octoploid analysis was never extensively tested. I will do my best to help you through your analysis though. We might have to trace backwards to see where reads are lost.

chongjing commented 3 months ago

Thanks so much for quick reply.

The reference look to be the same.

I would say they are not the same, as the reads are aligned to chr[A|B|C|D].exon.fa which are basically genes. But the refseq.chrA.gtf is annotation for whole genome.

Can you confirm that the BAM file has reads? What does the output in $F.ref.chrA.list look like?

Unfortunately the octoploid analysis was never extensively tested. I will do my best to help you through your analysis though. We might have to trace backwards to see where reads are lost.

It seems the BAM file is fine:

V300110184L2C006R0410597611     129     g2.t1   162     0       128M1D22M       g42147.t1       140     0       AGGGGAGACTACTCCTCCGTCGTCGCGGTTGTCGGAGGGCTCGTCCGGTCTTAGCAGGTCGCTGTCGCGGTCCATTCTCTCGCCGAAGGTGTCGACTGCGTATATATCGTCTAGGGCTGCTACGATTGCATGGCTTTCTCCTCTTCTTCC   FFEFFDFEFFFFFFFFFFFFFFAFFFFF?EFFCFF8FEFEFFFFFFFCEFFEFFDFFFDFFFFFDFEEFFFFFFDFFFFFFFFFFFFFFFFEFF9FFFDFFFBFFFFFFFEDFFFFFFFFFF@F<FFFEFFFFFFFEFFDFDEFCEFFFF   AS:i:-63        XN:i:0  XM:i:11 XO:i:1  XG:i:1  NM:i:12 MD:Z:128^C2G2A1G0G1A0T0G3G0G0G1G1       YT:Z:UP
V300110184L2C003R0130387205     153     g3.t1   6       42      150M    =       6       0       TTTTGTTGGGGATATCGATGCGGTGGTGCCGGAGCTAATTCTTACAGTCCATGACGGGAAGGGTGTATATGTGCAAATTGATGATTACTCTGGATCACAAGAGCCGTCTTCAACTTGTTCTTGTTGGAAAAGATGGAACGTCTGGCGGTG   FGFGFGFGGFFFFFFGFGFFFGFGFGFEGFGGGFFFFGFFFGGFGGGFFFFFGFGGGFFFGGFDGGFGFFFGFFFFGFFG@FFFGEFGFGGGFFFEFFFFFGFFFGFFFFFGFGFGFFFFFFF>FGGFFFFFFFGFFFFFFFGGFGF>GG   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:150        YT:Z:UP
V300110184L2C002R0530103382     153     g3.t1   8       42      150M    =       8       0       TTGTTGGGGATATCGATGCGGTGGTGCCGGAGCTAATTCTTACAGTCCATGACGGGAAGGGTGTATATGTGCAAATTGATGATTACTCTGGATCACAAGAGCCGTCTTCAACTTGTTCTTGTTGGAAAAGATGGAACGTCTGGCGGTGGG   EGGFFFFFGFGGEGGFGFFFGGFFFGFGGFGDGGFFGFFGGGGEFFGGFGGFGGFFGGFGFGFBFGFFGGFGFFGFGFGGFFGFFFFFFGFFFGFFFFFFFFGGFFGEFFFGFGGFFGFFFFGGFFFGGGG@GGFFGFFFEGFGFFFFFG   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:150        YT:Z:UP

And $F.ref.chrA.list also looks fine:

V300110184L2C002R0040539179     UNK     g50383.t1       105     -8.998771       -18.616749      -92.152931      -
V300110184L2C006R0210223877     REF     g31588.t1       462     -0.057198       -8.647595777539312      -92.141655      -
V300110184L2C001R0020390451     UNK     g74259.t1       180     -0.100707       -9.718684       -92.170944      -
V300110184L2C001R0491032017     REF     g62117.t1       155     -18.648333      -22.88515760903499      -92.988495      -
V300110184L2C003R0230789834     UNK     g50383.t1       266     1.176992        -18.058962      -184.344871     -
V300110184L2C003R0141263402     REF     g20128.t1       996     -14.712079      -30.368663422663783     -184.433258     -
V300110184L2C002R0361309029     UNK     g31391.t1       200     -0.063814       -9.681791       -92.146133      -
V300110184L2C005R0560714001     REF     g31767.t1       807     -33.012351      -50.68906171133189      -185.26374      -

Some update: I made a 'fake' annotation gtf, with genes as 'chromosomes', like this:

g1.t1   .       transcript      1       615     .       +       .       transcript_id "g1.t1";
g1.t2   .       transcript      1       579     .       +       .       transcript_id "g1.t2";
g2.t1   .       transcript      1       1566    .       +       .       transcript_id "g2.t1";
g3.t1   .       transcript      1       879     .       +       .       transcript_id "g3.t1";
g4.t1   .       transcript      1       213     .       +       .       transcript_id "g4.t1";
g5.t1   .       transcript      1       435     .       +       .       transcript_id "g5.t1";

and this works, I got reads assigned to genes from featureCounts, $F.chrA.counts.txt looks like this:

# Program:featureCounts v2.0.2; Command:"featureCounts" "-T" "8" "-t" "transcript" "-g" "transcript_id" "-p" "--countReadPairs" "-a" "0.chrHap1.gtf" "-o" "A_G_1.chrHap1.counts.txt" "A_G_1.chrHap1.1.ref.bam"
Geneid  Chr     Start   End     Strand  Length  A_G_1.chrHap1.1.ref.bam
g1.t1   g1.t1   1       615     +       615     0
g1.t2   g1.t2   1       579     +       579     0
g2.t1   g2.t1   1       1566    +       1566    0
g3.t1   g3.t1   1       879     +       879     28
g4.t1   g4.t1   1       213     +       213     0
g5.t1   g5.t1   1       435     +       435     0
g6.t1   g6.t1   1       156     +       156     0
g7.t1   g7.t1   1       153     +       153     0
g8.t1   g8.t1   1       924     +       924     81
g9.t1   g9.t1   1       681     +       681     0
g10.t1  g10.t1  1       768     +       768     0
g11.t2  g11.t2  1       5847    +       5847    0
g11.t1  g11.t1  1       4317    +       4317    0
g12.t1  g12.t1  1       435     +       435     0
g13.t1  g13.t1  1       1950    +       1950    0
g14.t1  g14.t1  1       1134    +       1134    0
g15.t1  g15.t1  1       1056    +       1056    0
g15.t2  g15.t2  1       1053    +       1053    108
g15.t3  g15.t3  1       1092    +       1092    130
g16.t1  g16.t1  1       900     +       900     24
g17.t1  g17.t1  1       819     +       819     0
g18.t1  g18.t1  1       1041    +       1041    39
g18.t2  g18.t2  1       1038    +       1038    74

Please correct me if I am in wrong direction.

Best, Chongjing

tony-kuo commented 3 months ago

Ah I see. Looking at the workflow, it defaults to alignment to the transcriptome using bowtie. The other option, which is commented out, uses STAR to align with splicing to the genome. I'm not sure why I left bowtie as the default as I believe alignment to the genome is what I used in all other workflows.

I will change the workflow to default to using STAR genome alignment.

chongjing commented 3 months ago

Sounds great. Thanks.