ucdavis-bioinformatics / assemblathon2-analysis

collection of scripts and commands used by Ian Korf, Keith Bradnam, and Joe Fass in the analysis of Assemblathon 2 entries (assemblies)
21 stars 21 forks source link

Contig related Nx metrics from Scaffolds are biased upwards #2

Open mdavy86 opened 6 years ago

mdavy86 commented 6 years ago

Any metrics from assemblathon_stats.pl for Contigs calculated from Scaffolds are highly biased upwards in the order of 100%. This does not effect Scaffold metrics, only Contig related metrics when a Scaffold file is used as input.

What is happening is scaffolds are being split by default every N=25 bases, which is hard coded on L143 but other N break points are not being split for Contigs creating longer pseudo-contigs distorting the Nx metrics upwards.

Data for this example for reproducibility is available from;

HongYang Test

We have a fasta file of scaffolds, and a fasta file of contigs, we can run assemblathon_stats.pl against both to compare calculated metrics. If we do this what we find is an N50 contig length of 58864 in Kiwifruit_contig.fa, but when we calculate the N50 contig length in Kiwifruit_scaffold.fa.gz we get 117093, which is twice the amount.

We know that the contig N50 calculation is 58864, which is submitted and verified in NCBI

https://www.ncbi.nlm.nih.gov/assembly/GCA_000467755.1

## Contig stats
assemblathon_stats.pl Kiwifruit_contig.fa.gz > test1

## Scaffold stats
assemblathon_stats.pl Kiwifruit_scaffold.fa.gz > test2

Differencing the files to compare results;

$ diff -u3 test1 test2
--- test1       2018-01-23 10:03:16.770015274 +1300
+++ test2       2018-01-23 10:43:51.605648499 +1300
@@ -1,48 +1,48 @@

----------------- Information for assembly 'Kiwifruit_contig.fa.gz' ----------------
+---------------- Information for assembly 'Kiwifruit_scaffold.fa.gz' ----------------

-                                         Number of scaffolds      26721                 ## The first part we expect to be different, contigs versus scaffolds 
-                                     Total size of scaffolds  604217145
-                                            Longest scaffold     423496
-                                           Shortest scaffold        200
-                                 Number of scaffolds > 1K nt      26373  98.7%
-                                Number of scaffolds > 10K nt      12188  45.6%
-                               Number of scaffolds > 100K nt       1106   4.1%
-                                 Number of scaffolds > 1M nt          0   0.0%
+                                         Number of scaffolds       7698
+                                     Total size of scaffolds  616114069
+                                            Longest scaffold    3410229
+                                           Shortest scaffold        896
+                                 Number of scaffolds > 1K nt       7620  99.0%
+                                Number of scaffolds > 10K nt       2131  27.7%
+                               Number of scaffolds > 100K nt       1152  15.0%
+                                 Number of scaffolds > 1M nt        129   1.7%
                                 Number of scaffolds > 10M nt          0   0.0%
-                                          Mean scaffold size      22612
-                                        Median scaffold size       7933
-                                         N50 scaffold length      58864
-                                          L50 scaffold count       2977
-                                                 scaffold %A      32.54
-                                                 scaffold %C      17.59
-                                                 scaffold %G      17.60
-                                                 scaffold %T      32.27
-                                                 scaffold %N       0.00
+                                          Mean scaffold size      80036
+                                        Median scaffold size       3358
+                                         N50 scaffold length     646786
+                                          L50 scaffold count        280
+                                                 scaffold %A      31.92
+                                                 scaffold %C      17.25
+                                                 scaffold %G      17.26
+                                                 scaffold %T      31.65
+                                                 scaffold %N       1.92
                                          scaffold %non-ACGTN       0.00
                              Number of scaffold non-ACGTN nt          0

-                Percentage of assembly in scaffolded contigs       0.0%
-              Percentage of assembly in unscaffolded contigs     100.0%
-                      Average number of contigs per scaffold        1.0
-Average length of break (>25 Ns) between contigs in scaffold          0
+                Percentage of assembly in scaffolded contigs      93.7%
+              Percentage of assembly in unscaffolded contigs       6.3%
+                      Average number of contigs per scaffold        2.0
+Average length of break (>25 Ns) between contigs in scaffold       1507

-                                           Number of contigs      26721                ## The second part should be the same 26721 contigs, versus 9758 contigs (in scaffolds file)
-                              Number of contigs in scaffolds          0
-                          Number of contigs not in scaffolds      26721
-                                       Total size of contigs  604217145
-                                              Longest contig     423496
-                                             Shortest contig        200
-                                   Number of contigs > 1K nt      26373  98.7%
-                                  Number of contigs > 10K nt      12188  45.6%
-                                 Number of contigs > 100K nt       1106   4.1%
+                                           Number of contigs      15529
+                              Number of contigs in scaffolds       9758
+                          Number of contigs not in scaffolds       5771
+                                       Total size of contigs  604305128
+                                              Longest contig     830300
+                                             Shortest contig         65
+                                   Number of contigs > 1K nt      15348  98.8%
+                                  Number of contigs > 10K nt       7647  49.2%
+                                 Number of contigs > 100K nt       1895  12.2%
                                    Number of contigs > 1M nt          0   0.0%
                                   Number of contigs > 10M nt          0   0.0%
-                                            Mean contig size      22612
-                                          Median contig size       7933
-                                           N50 contig length      58864               ## N50 is considerably different 58864 versus 117093
-                                            L50 contig count       2977
+                                            Mean contig size      38915
+                                          Median contig size       9483
+                                           N50 contig length     117093
+                                            L50 contig count       1517
                                                    contig %A      32.54
                                                    contig %C      17.59
                                                    contig %G      17.60

The break size between HongYang scaffolds is n=2000, if we explicitly specify this in the call to assemblathon_stats.pl we get even more spurious results;

$ assemblathon_stats.pl -n 2000 Kiwifruit_scaffold.fa.gz
---------------- Information for assembly 'Kiwifruit_scaffold.fa.gz' ----------------

                                         Number of scaffolds       7698
                                     Total size of scaffolds  616114069
                                            Longest scaffold    3410229
                                           Shortest scaffold        896
                                 Number of scaffolds > 1K nt       7620  99.0%
                                Number of scaffolds > 10K nt       2131  27.7%
                               Number of scaffolds > 100K nt       1152  15.0%
                                 Number of scaffolds > 1M nt        129   1.7%
                                Number of scaffolds > 10M nt          0   0.0%
                                          Mean scaffold size      80036
                                        Median scaffold size       3358
                                         N50 scaffold length     646786
                                          L50 scaffold count        280
                                                 scaffold %A      31.92
                                                 scaffold %C      17.25
                                                 scaffold %G      17.26
                                                 scaffold %T      31.65
                                                 scaffold %N       1.92
                                         scaffold %non-ACGTN       0.00
                             Number of scaffold non-ACGTN nt          0

                Percentage of assembly in scaffolded contigs      73.5%
              Percentage of assembly in unscaffolded contigs      26.5%
                      Average number of contigs per scaffold        1.8
Average length of break (>25 Ns) between contigs in scaffold       1507

                                           Number of contigs      13777
                              Number of contigs in scaffolds       7076
                          Number of contigs not in scaffolds       6701
                                       Total size of contigs  605485125
                                              Longest contig    1554749
                                             Shortest contig         65
                                   Number of contigs > 1K nt      13656  99.1%
                                  Number of contigs > 10K nt       6737  48.9%
                                 Number of contigs > 100K nt       1840  13.4%
                                   Number of contigs > 1M nt          8   0.1%
                                  Number of contigs > 10M nt          0   0.0%
                                            Mean contig size      43949
                                          Median contig size       9240
                                           N50 contig length     140261
                                            L50 contig count       1175
                                                   contig %A      32.48
                                                   contig %C      17.56
                                                   contig %G      17.56
                                                   contig %T      32.21
                                                   contig %N       0.20
                                           contig %non-ACGTN       0.00
                               Number of contig non-ACGTN nt          0

Now the N50 has increased to 140,261, it should be 58,864.

mdavy86 commented 6 years ago

PR #3 fixes hard coding with -n 25 getopts arg.

## -n 1 for contigs file for consistency only, there are no N gaps in it.
$ perl assemblathon_stats.pl -n 1 Kiwifruit_contig.fa.gz > test1
$ perl assemblathon_stats.pl -n 1 Kiwifruit_scaffold.fa.gz > test2

 $ diff -u3 test1 test2
--- test1       2018-01-26 20:20:25.038780998 +1300
+++ test2       2018-01-26 20:07:19.846861413 +1300
@@ -1,48 +1,48 @@

----------------- Information for assembly 'Kiwifruit_contig.fa.gz' ----------------
+---------------- Information for assembly 'Kiwifruit_scaffold.fa.gz' ----------------

[ relevant contig section ]

-                                           Number of contigs      26721        ## published
-                              Number of contigs in scaffolds          0
-                          Number of contigs not in scaffolds      26721
-                                       Total size of contigs  604217145
+                                           Number of contigs      26805        ## estimate
+                              Number of contigs in scaffolds      21676
+                          Number of contigs not in scaffolds       5129
+                                       Total size of contigs  604289189
                                               Longest contig     423496
-                                             Shortest contig        200
-                                   Number of contigs > 1K nt      26373  98.7%
-                                  Number of contigs > 10K nt      12188  45.6%
+                                             Shortest contig          3        ## A bit small...
+                                   Number of contigs > 1K nt      26373  98.4%
+                                  Number of contigs > 10K nt      12188  45.5%
                                  Number of contigs > 100K nt       1106   4.1%
                                    Number of contigs > 1M nt          0   0.0%
                                   Number of contigs > 10M nt          0   0.0%
-                                            Mean contig size      22612
-                                          Median contig size       7933
-                                           N50 contig length      58864        ## published
-                                            L50 contig count       2977
+                                            Mean contig size      22544
+                                          Median contig size       7857
+                                           N50 contig length      58840        ## estimate
+                                            L50 contig count       2978
                                                    contig %A      32.54
                                                    contig %C      17.59
                                                    contig %G      17.60

The published N50 is 58864 with 26721 contigs, and the scaffold fasta contig estimate of N50 is 58840 with 26805 contigs, the difference of 26805 - 26721 = 84 due to assembly scaffolding process.