marbl / SALSA

SALSA: A tool to scaffold long read assemblies with Hi-C data
MIT License
178 stars 47 forks source link

FINAL.fasta generated from penultimate iteration. #54

Closed kfletcher88 closed 5 years ago

kfletcher88 commented 5 years ago

Hi,

Thanks for this tool.

I have been using the latest version to scaffold an assembly, however I have noticed that the statistics of the output assembly do not equal the statistics of the final iteration reported. Is there are reason for this, or could SALSA have continued into additional iterations?

I notice in previous issues that it has been mentioned that the -i parameter is now overwritten and iterations continue until the data is exhausted (https://github.com/machinegun/SALSA/issues/44#issuecomment-435959312 & https://github.com/machinegun/SALSA/issues/24#issuecomment-377586472), but the example below would suggest that further scaffolding could be performed and that there is evidence for the largest scaffold still increasing by 35%. My command did not specify number of iterations.

Would appreciate any help you can give.

Thanks

Example data:

#stats.sh from bbmap shows longest scaffold ~4.5 Mb and assembly contains 5064 scaffolds.
$ stats.sh scaffolds/scaffolds_FINAL.fasta
A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2690  0.2310  0.2312  0.2688  0.0065  0.0000  0.0000  0.4623  0.0351

Main genome scaffold total:             5064
Main genome contig total:               7811
Main genome scaffold sequence total:    126.785 MB
Main genome contig sequence total:      125.957 MB      0.653% gap
Main genome scaffold N/L50:             53/597.009 KB
Main genome contig N/L50:               469/74.661 KB
Main genome scaffold N/L90:             854/12.856 KB
Main genome contig N/L90:               2220/8.207 KB
Max scaffold length:                    4.526 MB
Max contig length:                      697.806 KB
Number of scaffolds > 50 KB:            282
% main genome in scaffolds > 50 KB:     78.55%

Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                  5,064           7,811     126,785,198     125,957,098    99.35%
    500                  5,064           7,811     126,785,198     125,957,098    99.35%
   1 KB                  5,058           7,805     126,779,354     125,951,254    99.35%
 2.5 KB                  2,510           4,990     122,884,307     122,080,714    99.35%
   5 KB                  1,541           3,800     119,528,414     118,790,761    99.38%
  10 KB                    986           3,001     115,618,480     114,968,507    99.44%
  25 KB                    532           2,294     108,290,175     107,756,368    99.51%
  50 KB                    282           1,824      99,591,345      99,140,714    99.55%
 100 KB                    177           1,515      92,416,321      92,043,085    99.60%
 250 KB                     91           1,157      79,149,858      78,866,455    99.64%
 500 KB                     61             972      67,866,447      67,631,495    99.65%
   1 MB                     25             577      42,364,736      42,229,476    99.68%
 2.5 MB                      3             121      10,664,405      10,638,478    99.76%

#Longest scaffold from iteration 5 is larger and contains fewer scaffolds.
$ wc -l scaffolds/scaffold_length_iteration_5 && sort -nrk2,2 scaffolds/scaffold_length_iteration_5 | head -1
4970 scaffolds/scaffold_length_iteration_5
scaffold_1643   6128428

#Longest scaffold from iteration 4 more similar to the output assembly. Scaffold count the same.
$ wc -l scaffolds/scaffold_length_iteration_4 && sort -nrk2,2 scaffolds/scaffold_length_iteration_4 | head -1
5064 scaffolds/scaffold_length_iteration_4
scaffold_3642   4518477

The original SALSA command was:

python run_pipeline.py -a jelly.out.fasta -l jelly.out.fasta.fai -b alignment.bed -e GATC -o scaffolds &> salsa.log

The log file shows that iteration 5 began, but maybe didn't complete?:

bedfile loaded
Starting Iteration 1
bedfile started
bedfile loaded
Loading Hi-C links
Hybrid scaffold graph loaded, nodes = 6444 edges = 4125
Hi-C implied edges = 0
Starting Iteration 2
bedfile started
bedfile loaded
Starting Iteration 2
Loading Hi-C links
Hybrid scaffold graph loaded, nodes = 4398 edges = 2470
Hi-C implied edges = 0
Starting Iteration 3
bedfile started
bedfile loaded
Starting Iteration 3
Loading Hi-C links
Hybrid scaffold graph loaded, nodes = 4364 edges = 2346
Hi-C implied edges = 0
Starting Iteration 4
bedfile started
bedfile loaded
Starting Iteration 4
Loading Hi-C links
Hybrid scaffold graph loaded, nodes = 4270 edges = 2248
Hi-C implied edges = 0
python RE_sites.py -a scaffolds/assembly.cleaned.fasta -e GATC > scaffolds/re_counts_iteration_1
python make_links.py -b scaffolds/alignment_iteration_1.bed -d scaffolds -i 1 -x abc
python fast_scaled_scores.py -d scaffolds -i 1
sort -k 5 -gr scaffolds/contig_links_scaled_iteration_1 > scaffolds/contig_links_scaled_sorted_iteration_1
python layout_unitigs.py -x abc -l scaffolds/contig_links_scaled_sorted_iteration_1 -c 1000 -i 1 -d scaffolds
break_contigs -a scaffolds/alignment_iteration_2.bed -b scaffolds/breakpoints_iteration_2.txt -l scaffolds/scaffold_length_iteration_2 -i 2 -s 100   > scaffolds/misasm_iteration_2.report
python refactor_breaks.py -d scaffolds -i 2
python make_links.py -b scaffolds/alignment_iteration_2.bed -d scaffolds -i 2
python layout_unitigs.py -x abc -l scaffolds/contig_links_scaled_sorted_iteration_2 -c 1000 -i 2 -d scaffolds
break_contigs -a scaffolds/alignment_iteration_3.bed -b scaffolds/breakpoints_iteration_3.txt -l scaffolds/scaffold_length_iteration_3 -i 3 -s 100  > scaffolds/misasm_iteration_3.report
python refactor_breaks.py -d scaffolds -i 3 > scaffolds/misasm_3.log
python make_links.py -b scaffolds/alignment_iteration_3.bed -d scaffolds -i 3
python layout_unitigs.py -x abc -l scaffolds/contig_links_scaled_sorted_iteration_3 -c 1000 -i 3 -d scaffolds
break_contigs -a scaffolds/alignment_iteration_4.bed -b scaffolds/breakpoints_iteration_4.txt -l scaffolds/scaffold_length_iteration_4 -i 4 -s 100  > scaffolds/misasm_iteration_4.report
python refactor_breaks.py -d scaffolds -i 4 > scaffolds/misasm_4.log
python make_links.py -b scaffolds/alignment_iteration_4.bed -d scaffolds -i 4
python layout_unitigs.py -x abc -l scaffolds/contig_links_scaled_sorted_iteration_4 -c 1000 -i 4 -d scaffolds
break_contigs -a scaffolds/alignment_iteration_5.bed -b scaffolds/breakpoints_iteration_5.txt -l scaffolds/scaffold_length_iteration_5 -i 5 -s 100  > scaffolds/misasm_iteration_5.report
python refactor_breaks.py -d scaffolds -i 5 > scaffolds/misasm_5.log
kfletcher88 commented 5 years ago

Subsequent to this I have tried setting the iteration parameter. Using the exact same inputs I set -i 10 and ran SALSA. This time the output presents 12 iterations and the assembly output is reflective of the 11th. This means that there is evidence for further joins (in this case only 14) which are not being presented in the final assembly. I will set -i higher to see if it keeps going beyond this.

Example: Command

$ python run_pipeline.py -a jelly.out.fasta -l jelly.out.fasta.fai -b alignment.bed -e GATC -o scaffolds-i10 -i 10 &> salsa_i10.log

Abbreviated output assembly stats:

$ stats.sh scaffolds-i10/scaffolds_FINAL.fasta
... Main genome scaffold total:             4757
... Max scaffold length:                    17.669 MB

Scaffold count and max scaffold length for iterations 12 and 11

$ wc -l scaffolds-i10/scaffold_length_iteration_12 && sort -nrk2,2 scaffolds-i10/scaffold_length_iteration_12 | head -1
4743 scaffolds-i10/scaffold_length_iteration_12
scaffold_4259   17622210

$ wc -l scaffolds-i10/scaffold_length_iteration_11 && sort -nrk2,2 scaffolds-i10/scaffold_length_iteration_11 | head -1
4757 scaffolds-i10/scaffold_length_iteration_11
scaffold_489    17622210

Abbreviated log file again shows iteration 12 as beginning, not sure why it terminated though:

...
python layout_unitigs.py -x abc -l scaffolds-i10/contig_links_scaled_sorted_iteration_9 -c 1000 -i 9 -d scaffolds-i10
break_contigs -a scaffolds-i10/alignment_iteration_10.bed -b scaffolds-i10/breakpoints_iteration_10.txt -l scaffolds-i10/scaffold_length_iteration_10 -i 10 -s 100  > scaffolds-i10/misasm_iteration_10.report
python refactor_breaks.py -d scaffolds-i10 -i 10 > scaffolds-i10/misasm_10.log
python make_links.py -b scaffolds-i10/alignment_iteration_10.bed -d scaffolds-i10 -i 10
python layout_unitigs.py -x abc -l scaffolds-i10/contig_links_scaled_sorted_iteration_10 -c 1000 -i 10 -d scaffolds-i10
break_contigs -a scaffolds-i10/alignment_iteration_11.bed -b scaffolds-i10/breakpoints_iteration_11.txt -l scaffolds-i10/scaffold_length_iteration_11 -i 11 -s 100  > scaffolds-i10/misasm_iteration_11.report
python refactor_breaks.py -d scaffolds-i10 -i 11 > scaffolds-i10/misasm_11.log
python make_links.py -b scaffolds-i10/alignment_iteration_11.bed -d scaffolds-i10 -i 11
python layout_unitigs.py -x abc -l scaffolds-i10/contig_links_scaled_sorted_iteration_11 -c 1000 -i 11 -d scaffolds-i10
break_contigs -a scaffolds-i10/alignment_iteration_12.bed -b scaffolds-i10/breakpoints_iteration_12.txt -l scaffolds-i10/scaffold_length_iteration_12 -i 12 -s 100  > scaffolds-i10/misasm_iteration_12.report
python refactor_breaks.py -d scaffolds-i10 -i 12 > scaffolds-i10/misasm_12.log
ghuryejay commented 5 years ago

Hi,

Although joins can be made during scaffolding may not mean that they are correct. SALSA ties to keep as many high confident joins as possible in the scaffolds. Also, if it runs for 12 iterations, it would output scaffolds at 11th iteration as final scaffolds, because in 12 iterations, most new joins are flagged as wrong.