Open mdavy86 opened 6 years ago
PR #3 fixes hard coding with -n 25 getopts arg.
## -n 1 for contigs file for consistency only, there are no N gaps in it.
$ perl assemblathon_stats.pl -n 1 Kiwifruit_contig.fa.gz > test1
$ perl assemblathon_stats.pl -n 1 Kiwifruit_scaffold.fa.gz > test2
$ diff -u3 test1 test2
--- test1 2018-01-26 20:20:25.038780998 +1300
+++ test2 2018-01-26 20:07:19.846861413 +1300
@@ -1,48 +1,48 @@
----------------- Information for assembly 'Kiwifruit_contig.fa.gz' ----------------
+---------------- Information for assembly 'Kiwifruit_scaffold.fa.gz' ----------------
[ relevant contig section ]
- Number of contigs 26721 ## published
- Number of contigs in scaffolds 0
- Number of contigs not in scaffolds 26721
- Total size of contigs 604217145
+ Number of contigs 26805 ## estimate
+ Number of contigs in scaffolds 21676
+ Number of contigs not in scaffolds 5129
+ Total size of contigs 604289189
Longest contig 423496
- Shortest contig 200
- Number of contigs > 1K nt 26373 98.7%
- Number of contigs > 10K nt 12188 45.6%
+ Shortest contig 3 ## A bit small...
+ Number of contigs > 1K nt 26373 98.4%
+ Number of contigs > 10K nt 12188 45.5%
Number of contigs > 100K nt 1106 4.1%
Number of contigs > 1M nt 0 0.0%
Number of contigs > 10M nt 0 0.0%
- Mean contig size 22612
- Median contig size 7933
- N50 contig length 58864 ## published
- L50 contig count 2977
+ Mean contig size 22544
+ Median contig size 7857
+ N50 contig length 58840 ## estimate
+ L50 contig count 2978
contig %A 32.54
contig %C 17.59
contig %G 17.60
The published N50 is 58864 with 26721 contigs, and the scaffold fasta contig estimate of N50 is 58840 with 26805 contigs, the difference of 26805 - 26721 = 84 due to assembly scaffolding process.
Any metrics from assemblathon_stats.pl for Contigs calculated from Scaffolds are highly biased upwards in the order of 100%. This does not effect Scaffold metrics, only Contig related metrics when a Scaffold file is used as input.
What is happening is scaffolds are being split by default every N=25 bases, which is hard coded on L143 but other N break points are not being split for Contigs creating longer pseudo-contigs distorting the Nx metrics upwards.
Data for this example for reproducibility is available from;
HongYang Test
We have a fasta file of scaffolds, and a fasta file of contigs, we can run assemblathon_stats.pl against both to compare calculated metrics. If we do this what we find is an N50 contig length of 58864 in Kiwifruit_contig.fa, but when we calculate the N50 contig length in Kiwifruit_scaffold.fa.gz we get 117093, which is twice the amount.
We know that the contig N50 calculation is 58864, which is submitted and verified in NCBI
https://www.ncbi.nlm.nih.gov/assembly/GCA_000467755.1
Differencing the files to compare results;
The break size between HongYang scaffolds is n=2000, if we explicitly specify this in the call to
assemblathon_stats.pl
we get even more spurious results;Now the N50 has increased to 140,261, it should be 58,864.