Assess quality of GenSAS & GAWN annotations, choose one to use in analyses

laurahspencer commented 1 year ago

running RNASeq alignment with each annotation data to see how alignment rates compare
run BUSCO

kristamnichols commented 1 year ago

Run RNAseq through HiSat / STAR with the annotated genomes to assess the quality of the annotations. Which one do we want to use -- GenSAS or GAWN? Potentially run BUSCO as well on the annotated genes. @sr320 with @kubu4 will pull the fastas from the annotations @ggoetznoaa will run Hisat and/or STAR

kristamnichols commented 1 year ago

IGV is a possibility to further examine alignment of RNAseq to the annotated genes, if needed

sr320 commented 1 year ago

I vote for GenSAS - https://github.com/laurahspencer/DuMOAR/issues/25#issuecomment-1550545990

GAWN if very wonky. @ggoetznoaa what is your assessment from alignment?

ggoetznoaa commented 1 year ago

@sr320 I didn't realize the gtf/gff files were ready, I haven't done any alignments yet. Where can I grab them so I can do the alignments? @laurahspencer

laurahspencer commented 1 year ago

@ggoetznoaa check out #25 for links to all annotation files from both GAWN and GenSaS, including the gene .gff's. Let me know if you have questions!

ggoetznoaa commented 1 year ago

Ok, I see only one gff3 file for the GAWN output but I see two for the GenSAS. I'm assuming the correct one to use is the Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.gff3 file since the other one doesn't seem to include any notes about exons, CDS, or genes. Correct me if I'm wrong please.

laurahspencer commented 1 year ago

@ggoetznoaa yes that's the correct .gff from GenSaS (the other one is for repeat elements). Thanks!

ggoetznoaa commented 1 year ago

@sr320 @laurahspencer @kristamnichols So here are the results from using STAR with the scaffold only reference + the gff files. Note, I had to convert the gff files to gtf files using gffread for STAR.

GAWN:

                                 Started job on |   May 18 07:48:03
                             Started mapping on |   May 18 07:48:08
                                    Finished on |   May 18 20:26:45
       Mapping speed, Million of reads per hour |   49.72

                          Number of input reads |   628684248
                      Average input read length |   291
                                    UNIQUE READS:
                   Uniquely mapped reads number |   167964190
                        Uniquely mapped reads % |   26.72%
                          Average mapped length |   274.46
                       Number of splices: Total |   57667901
            Number of splices: Annotated (sjdb) |   51288294
                       Number of splices: GT/AG |   54740890
                       Number of splices: GC/AG |   353673
                       Number of splices: AT/AC |   38370
               Number of splices: Non-canonical |   2534968
                      Mismatch rate per base, % |   1.16%
                         Deletion rate per base |   0.17%
                        Deletion average length |   2.74
                        Insertion rate per base |   0.22%
                       Insertion average length |   3.24
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   10658119
             % of reads mapped to multiple loci |   1.70%
        Number of reads mapped to too many loci |   478838
             % of reads mapped to too many loci |   0.08%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   449217499
                 % of reads unmapped: too short |   71.45%
                Number of reads unmapped: other |   365602
                     % of reads unmapped: other |   0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

GENSAS

                                 Started job on |   May 18 07:48:26
                             Started mapping on |   May 18 07:48:48
                                    Finished on |   May 18 20:22:40
       Mapping speed, Million of reads per hour |   50.04

                          Number of input reads |   628684248
                      Average input read length |   291
                                    UNIQUE READS:
                   Uniquely mapped reads number |   168192046
                        Uniquely mapped reads % |   26.75%
                          Average mapped length |   274.23
                       Number of splices: Total |   51522424
            Number of splices: Annotated (sjdb) |   9160977
                       Number of splices: GT/AG |   49391809
                       Number of splices: GC/AG |   268932
                       Number of splices: AT/AC |   17672
               Number of splices: Non-canonical |   1844011
                      Mismatch rate per base, % |   1.16%
                         Deletion rate per base |   0.17%
                        Deletion average length |   2.74
                        Insertion rate per base |   0.22%
                       Insertion average length |   3.23
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   10201122
             % of reads mapped to multiple loci |   1.62%
        Number of reads mapped to too many loci |   515333
             % of reads mapped to too many loci |   0.08%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   449406169
                 % of reads unmapped: too short |   71.48%
                Number of reads unmapped: other |   369578
                     % of reads unmapped: other |   0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

I used default settings. The only thing I could think of changing at this point would be sjdbOverhang.

Excerpt about it from the STAR manual.

--sjdbOverhang specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as well as the ideal value.

sr320 commented 1 year ago

Thanks

do you have information on the percent reads mapped to genes?

ggoetznoaa commented 1 year ago

What you see is all that I got out of STAR. There is a SJ.out.tab file for each run but that is talking about splice junctions. There is a quant mode I can run that will count the number of reads that map to 'genes'. Will that work for you?

With --quantMode GeneCounts option STAR will count number reads per gene while mapping.
A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the paired-
end read are checked for overlaps. The counts coincide with those produced by htseq-count with
default parameters

sr320 commented 1 year ago

Yes please - looking for gene annotation accuracy here

On Fri, May 19, 2023 at 9:54 AM ggoetznoaa @.***> wrote:

What you see is all that I got out of STAR. There is a SJ.out.tab file for each run but that is talking about splice junctions. There is a quant mode I can run that will count the number of reads that map to 'genes'. Will that work for you?

With --quantMode GeneCounts option STAR will count number reads per gene while mapping. A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the paired- end read are checked for overlaps. The counts coincide with those produced by htseq-count with default parameters

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/laurahspencer/DuMOAR/issues/24*issuecomment-1554907548__;Iw!!K-Hz7m0Vt54!n52C9eDbIo4OI-Hi9J_7xp6QH0a4Eu3j9ellQYsFlv3BAjxE9tGJQJbrCn6ti3jJ3Pwya5xlNn0gguNQEw-5CAM$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PNZA52XKEIDP3ZXCOZDXG6QT3ANCNFSM6AAAAAAXLQRIJ4__;!!K-Hz7m0Vt54!n52C9eDbIo4OI-Hi9J_7xp6QH0a4Eu3j9ellQYsFlv3BAjxE9tGJQJbrCn6ti3jJ3Pwya5xlNn0gguNQyPFXG2I$ . You are receiving this because you were mentioned.Message ID: @.***>

--

Steven B. Roberts, Professor Associate Director - Graduate Program Coordinator School of Aquatic and Fishery Sciences University of Washington Fisheries Teaching and Research (FTR) Building - Office 232 1140 NE Boat Street - Seattle, WA 98105 robertslab.info - @.*** - @sr320 vm:206.866.5141 - cell:360.362.3626 schedule a zoom call: https://d.pr/gsgxVJ

ggoetznoaa commented 1 year ago

Ok, so I remapped the reads to the scaffolds only reference using the gff files from GAWN and GENSAS using STAR's quantmode GeneCounts. Here is what I got.

From the STAR manual, so you understand the different columns

STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:
column 1: gene ID
column 2: counts for unstranded RNA-seq
column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

GAWN:

                                 Started job on |   May 19 10:11:14
                             Started mapping on |   May 19 10:11:21
                                    Finished on |   May 19 22:57:28
       Mapping speed, Million of reads per hour |   49.24

                          Number of input reads |   628684248
                      Average input read length |   291
                                    UNIQUE READS:
                   Uniquely mapped reads number |   167964190
                        Uniquely mapped reads % |   26.72%
                          Average mapped length |   274.46
                       Number of splices: Total |   57667901
            Number of splices: Annotated (sjdb) |   51288294
                       Number of splices: GT/AG |   54740890
                       Number of splices: GC/AG |   353673
                       Number of splices: AT/AC |   38370
               Number of splices: Non-canonical |   2534968
                      Mismatch rate per base, % |   1.16%
                         Deletion rate per base |   0.17%
                        Deletion average length |   2.74
                        Insertion rate per base |   0.22%
                       Insertion average length |   3.24
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   10658119
             % of reads mapped to multiple loci |   1.70%
        Number of reads mapped to too many loci |   478838
             % of reads mapped to too many loci |   0.08%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   449217499
                 % of reads unmapped: too short |   71.45%
                Number of reads unmapped: other |   365602
                     % of reads unmapped: other |   0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

GeneCounts
N_unmapped     450061939 450061939 450061939
N_multimapping  10658119  10658119  10658119
N_noFeature     36265737  82925982  82750532
N_ambiguous     78636285  48312084  48287630
N_genecounts    53062168  36726124  36926028
total          628684248 628684248 628684248

8.4% unstranded mapping

GENSAS

                                 Started job on |   May 19 10:11:19
                             Started mapping on |   May 19 10:11:27
                                    Finished on |   May 19 22:49:15
       Mapping speed, Million of reads per hour |   49.78

                          Number of input reads |   628684248
                      Average input read length |   291
                                    UNIQUE READS:
                   Uniquely mapped reads number |   168192046
                        Uniquely mapped reads % |   26.75%
                          Average mapped length |   274.23
                       Number of splices: Total |   51522424
            Number of splices: Annotated (sjdb) |   9160977
                       Number of splices: GT/AG |   49391809
                       Number of splices: GC/AG |   268932
                       Number of splices: AT/AC |   17672
               Number of splices: Non-canonical |   1844011
                      Mismatch rate per base, % |   1.16%
                         Deletion rate per base |   0.17%
                        Deletion average length |   2.74
                        Insertion rate per base |   0.22%
                       Insertion average length |   3.23
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   10201122
             % of reads mapped to multiple loci |   1.62%
        Number of reads mapped to too many loci |   515333
             % of reads mapped to too many loci |   0.08%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   449406169
                 % of reads unmapped: too short |   71.48%
                Number of reads unmapped: other |   369578
                     % of reads unmapped: other |   0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

GeneCounts
N_unmapped     450291080 450291080 450291080
N_multimapping  10201122  10201122  10201122
N_noFeature    141759082 154858765 155055338
N_ambiguous       963641    462265    465057
N_genecounts    25469323  12871016  12671651
total          628684248 628684248 628684248

4.1% unstranded mapping

ggoetznoaa commented 1 year ago

Ok, so update time. I've run both STAR using different settings as well as HISAT2 using default settings using both the GAWN and GENSAS gtf files. The STAR settings I used are as follows

    --outFilterScoreMinOverLread 0.33 \
    --outFilterMatchNminOverLread 0.33 \
    --outFilterMismatchNmax 2 \

I grabbed these from some searches on the internet, in some cases people went as far as settings the first two to 0.

STAR GAWN 37% overall mapping (33.88% unique, 3.10% multi)

STAR GENSAS 37% overall mapping (33.95% unique, 3.03% multi)

I used the default settings for HISAT2 and I used the hisat2_extract_exons.py and the hisat2_extract_splice_sites.py script to extract the exon and splice site info from the gtf files.

HISAT2 GAWN Average mapping rate 23.64%

HISAT2 GENSAS Average mapping rate 23.69%

laurahspencer commented 1 year ago

Decision is to use the GenSaS annotation file.

laurahspencer / DuMOAR

Assess quality of GenSAS & GAWN annotations, choose one to use in analyses #24