mikolmogorov / Ragout

Chromosome-level scaffolding using multiple references
Other
146 stars 27 forks source link

No synteny blocks found inspite of having block coverage of 89.86% #62

Closed drashwinelkar closed 1 year ago

drashwinelkar commented 4 years ago

I am running ragout and using a genome of the same organism as a reference for assembly, but I still get the following error

[11:47:30] root: INFO: Starting Ragout v2.3 [11:47:30] root: INFO: Running withs synteny block sizes '[5000, 500, 100]' [11:47:30] root: WARNING: Using existing Sibelia results from previous run [11:47:30] root: WARNING: Use --overwrite to force alignment [11:47:30] root: INFO: Inferring phylogeny from synteny blocks data [11:47:30] root: INFO: Reading /home/ashwin/Projects/tetanus/trimmed_raw_data/ragout_container/sibelia-workdir/100/blocks_coords.txt [11:47:30] root: INFO: "ctet_harv" synteny blocks coverage: 92.29% [11:47:30] root: INFO: "ctet_SIIPL" synteny blocks coverage: 99.75% [11:47:30] root: DEBUG: Read 4 reference sequences [11:47:30] root: DEBUG: Read 49 target sequences [11:47:30] root: DEBUG: 43 target sequences left after indel filtering [11:47:30] root: DEBUG: 1 target sequences left after repeat filtering [11:47:30] root: DEBUG: Branch lengths: [1e-06, 1e-06], mu = 1000000.000000 [11:47:30] root: INFO: Inferred tree: ('ctet_SIIPL' : 1e-06, 'ctet_harv' : 1e-06) [11:47:30] root: INFO: 'ctet_harv' is chosen as a naming reference [11:47:30] root: INFO: Processing permutation files [11:47:30] root: INFO: Reading /home/ashwin/Projects/tetanus/trimmed_raw_data/ragout_container/sibelia-workdir/5000/blocks_coords.txt [11:47:30] root: INFO: "ctet_harv" synteny blocks coverage: 85.77% [11:47:30] root: INFO: "ctet_SIIPL" synteny blocks coverage: 89.86% [11:47:30] root: DEBUG: Read 4 reference sequences [11:47:30] root: DEBUG: Read 27 target sequences [11:47:30] root: DEBUG: 27 target sequences left after indel filtering [11:47:30] root: DEBUG: 0 target sequences left after repeat filtering [11:47:30] root: ERROR: An error occured while running Ragout: [11:47:30] root: ERROR: No synteny blocks found in the target genome after repeat/indel filtering.

Why do I get no synteny blocks even after having a coverage of 89%? any idea how to solve this? I am using ragout2.3 compiled from source from github

drashwinelkar commented 4 years ago

Sorry, I forgot to mention, the assembly was done using Abyss 2.1

mikolmogorov commented 4 years ago

Hi,

Could you provide more info on the genomes and assembly statistics? Is assembly size similar to the reference genome size?

Basically, the run failed because the entire assembled genome was marked as a repeat - that means each synteny block in target genome was repeated at least twice in the reference. Definitely unusual to see.

Best, Mikhail

dzolier commented 1 year ago

I am experiencing this same problem. My recipe file (which has species names and accession numbers, if it helps) looks like this:

#reference and target genome names (required)
.references = B_amiloliquefaciens,B_safensisAHB11,B_safensisPgKB20,B_velezensis
.target = T9_Bacillus

#paths to genome fasta files (required for Sibelia)
B_amiloliquefaciens.fasta = /mnt/d/Ragout/ref_genomes/filtered_bin_05-Bacillus/references/good_references/B_amiloliquefaciens_GCF_000242855.2_ASM24285v2_genomic.fasta
B_safensisAHB11.fasta = /mnt/d/Ragout/ref_genomes/filtered_bin_05-Bacillus/references/good_references/B_safensisAHB11_GCF_023716825.1_ASM2371682v1_genomic.fasta
B_safensisPgKB20.fasta = /mnt/d/Ragout/ref_genomes/filtered_bin_05-Bacillus/references/good_references/B_safensisPgKB20_GCF_008244765.1_ASM824476v1_genomic.fasta
B_velezensis.fasta = /mnt/d/Ragout/ref_genomes/filtered_bin_05-Bacillus/references/good_references/B_velezensis_GCF_000769555.1_ASM76955v1_genomic.fasta
T9_Bacillus.fasta = /mnt/d/Ragout/ref_genomes/bins_for_assembly/trimmed_metawrap_50_10_bins/trimmed_bin.9.fasta

.blocks = small

And my ragout.log for this run looks like this:

[18:29:06] root: INFO: Starting Ragout v2.3
[18:29:06] root: INFO: Running withs synteny block sizes '[5000, 500, 100]'
[18:29:07] root: INFO: Running Sibelia with block size 5000
[18:31:54] root: INFO: Running Sibelia with block size 500
[18:34:38] root: INFO: Running Sibelia with block size 100
[18:37:24] root: INFO: Inferring phylogeny from synteny blocks data
[18:37:24] root: INFO: Reading /mnt/d/Ragout/9T_out/sibelia-workdir/100/blocks_coords.txt
[18:37:24] root: INFO: "B_amiloliquefaciens" synteny blocks coverage: 94.18%
[18:37:24] root: INFO: "B_safensisAHB11" synteny blocks coverage: 96.19%
[18:37:24] root: INFO: "B_safensisPgKB20" synteny blocks coverage: 92.69%
[18:37:24] root: INFO: "B_velezensis" synteny blocks coverage: 92.29%
[18:37:24] root: INFO: "T9_Bacillus" synteny blocks coverage: 95.38%
[18:37:24] root: DEBUG: Read 6 reference sequences
[18:37:24] root: DEBUG: Read 17 target sequences
[18:37:24] root: DEBUG: 17 target sequences left after indel filtering
[18:37:24] root: DEBUG: 17 target sequences left after repeat filtering
[18:37:24] root: DEBUG: Branch lengths: [11.8125, 19.833333333333332, 15.166666666666668, 11.8125, 41.375, 7.625, 1.125, 5.875], mu = 0.084656
[18:37:24] root: INFO: Inferred tree: (('B_safensisAHB11' : 19.833333333333332, 'T9_Bacillus' : 15.166666666666668) : 11.8125, ('B_safensisPgKB20' : 41.375, ('B_amiloliquefaciens' : 1.125, 'B_velezensis' : 5.875) : 7.625) : 11.8125)
[18:37:24] root: INFO: 'B_safensisAHB11' is chosen as a naming reference
[18:37:24] root: INFO: Processing permutation files
[18:37:24] root: INFO: Reading /mnt/d/Ragout/9T_out/sibelia-workdir/5000/blocks_coords.txt
[18:37:24] root: INFO: "B_amiloliquefaciens" synteny blocks coverage: 91.66%
[18:37:24] root: INFO: "B_safensisAHB11" synteny blocks coverage: 90.58%
[18:37:24] root: INFO: "B_safensisPgKB20" synteny blocks coverage: 88.03%
[18:37:24] root: INFO: "B_velezensis" synteny blocks coverage: 90.33%
[18:37:24] root: INFO: "T9_Bacillus" synteny blocks coverage: 90.33%
[18:37:24] root: DEBUG: Read 4 reference sequences
[18:37:24] root: DEBUG: Read 17 target sequences
[18:37:24] root: DEBUG: 0 target sequences left after indel filtering
[18:37:24] root: DEBUG: 0 target sequences left after repeat filtering
[18:37:24] root: ERROR: An error occured while running Ragout:
[18:37:24] root: ERROR: No synteny blocks found in the target genome after repeat/indel filtering.

The bin I'm using has 17 contigs, N50 = 692,066, total number of bases in bin = 3,662,481. CheckM says it's a Bacillus bin; kaiju appears to agree.

mikolmogorov commented 1 year ago

Based on the log, it seems that all informative synteny blocks were filtered as repeats. Could it be that one of the references (or assembly) contains duplications? What are the ChecM scores for the reference and assembly (completion and contamination)?

Mikhail

dzolier commented 1 year ago

CheckM (a la metaWRAP bin_refine) says

completeness = 99.58
contamination = 0.207
mikolmogorov commented 1 year ago

@dzolier sorry for the late response. What genome are these chekm values from? I'd need to see those for all reference and target genome separately. If there is a duplication, we will see high "contamination" rates. Could you also share 9T_out/sibelia-workdir/?

mattbawn commented 1 year ago

Hi @fenderglass , I am also seeing something similar. I have previously used Ragout on Salmonella genomes but am now attempting to use on S. aureus. my log file reads:


[13:39:36] root: INFO: Starting Ragout v2.0
[13:39:37] root: INFO: Running Sibelia with block size 5000
[13:44:26] root: INFO: Running Sibelia with block size 500
[13:49:17] root: INFO: Running Sibelia with block size 100
[13:54:09] root: INFO: Inferring phylogeny from synteny blocks data
[13:54:09] root: DEBUG: Reading permutation file
[13:54:09] root: DEBUG: "fixstart_74" synteny blocks coverage: 91.48%
[13:54:09] root: DEBUG: "fixstart_98" synteny blocks coverage: 95.93%
[13:54:09] root: DEBUG: "fixstart_99" synteny blocks coverage: 98.39%
[13:54:09] root: DEBUG: "fixstart_110" synteny blocks coverage: 96.83%
[13:54:09] root: DEBUG: "fixstart_61" synteny blocks coverage: 96.35%
[13:54:09] root: DEBUG: "fixstart_72" synteny blocks coverage: 91.5%
[13:54:09] root: DEBUG: "fixstart_73" synteny blocks coverage: 91.53%
[13:54:09] root: DEBUG: "fixstart_93" synteny blocks coverage: 97.07%
[13:54:09] root: DEBUG: "fixstart_84" synteny blocks coverage: 98.01%
[13:54:09] root: DEBUG: "assembly_ragout" synteny blocks coverage: 88.24%
[13:54:09] root: DEBUG: "fixstart_57" synteny blocks coverage: 95.55%
[13:54:09] root: DEBUG: "fixstart_4" synteny blocks coverage: 95.48%
[13:54:09] root: DEBUG: "fixstart_5" synteny blocks coverage: 98.59%
[13:54:09] root: DEBUG: "fixstart_7" synteny blocks coverage: 98.14%
[13:54:09] root: DEBUG: Read 17 reference sequences
[13:54:09] root: DEBUG: Read 152 target sequences
[13:54:09] root: DEBUG: 147 target sequences left after indel filtering
[13:54:09] root: DEBUG: 140 target sequences left after repeat filtering
[13:54:09] root: DEBUG: Branch lengths: [1e-06, 1e-06, 1e-06, 1e-06, 1e-06, 0.5625, 21.4375, 0.0859375, 0.3828125, 0.7109375, 1e-06, 0.2, 1.1458333333333333, 0.41666666666666674, 1e-06, 0.7589285714285714, 1e-06, 1.046875, 0.7777777777777778, 2.2222222222222223, 1e-06, 22.65, 0.6590909090909091, 0.34090909090909094, 0.625, 0.375], mu = 2.66666666667
[13:54:09] root: INFO: ('fixstart_84' : 1e-06, ('fixstart_93' : 1e-06, ('fixstart_99' : 1e-06, ('assembly_ragout' : 21.4375, ('fixstart_7' : 0.3828125, ('fixstart_5' : 1e-06, ('fixstart_4' : 1.14583333333, ('fixstart_98' : 1e-06, ('fixstart_61' : 1e-06, ('fixstart_110' : 0.777777777778, ('fixstart_57' : 1e-06, ('fixstart_73' : 0.659090909091, ('fixstart_74' : 0.625, 'fixstart_72' : 0.375) : 0.340909090909) : 22.65) : 2.22222222222) : 1.046875) : 0.758928571429) : 0.416666666667) : 0.2) : 0.7109375) : 0.0859375) : 0.5625) : 1e-06) : 1e-06)
[13:54:09] root: INFO: Processing permutation files
[13:54:09] root: DEBUG: Reading permutation file
[13:54:09] root: DEBUG: "fixstart_74" synteny blocks coverage: 78.85%
[13:54:09] root: DEBUG: "fixstart_98" synteny blocks coverage: 90.43%
[13:54:09] root: DEBUG: "fixstart_99" synteny blocks coverage: 89.71%
[13:54:09] root: DEBUG: "fixstart_110" synteny blocks coverage: 90.73%
[13:54:09] root: DEBUG: "fixstart_61" synteny blocks coverage: 88.06%
[13:54:09] root: DEBUG: "fixstart_72" synteny blocks coverage: 78.86%
[13:54:09] root: DEBUG: "fixstart_73" synteny blocks coverage: 78.85%
[13:54:09] root: DEBUG: "fixstart_93" synteny blocks coverage: 88.47%
[13:54:09] root: DEBUG: "fixstart_84" synteny blocks coverage: 90.61%
[13:54:09] root: DEBUG: "assembly_ragout" synteny blocks coverage: 83.0%
[13:54:09] root: DEBUG: "fixstart_57" synteny blocks coverage: 89.24%
[13:54:09] root: DEBUG: "fixstart_4" synteny blocks coverage: 88.34%
[13:54:09] root: DEBUG: "fixstart_5" synteny blocks coverage: 90.71%
[13:54:09] root: DEBUG: "fixstart_7" synteny blocks coverage: 89.84%
[13:54:09] root: DEBUG: Read 15 reference sequences
[13:54:09] root: DEBUG: Read 71 target sequences
[13:54:09] root: DEBUG: 0 target sequences left after indel filtering
[13:54:09] root: DEBUG: 0 target sequences left after repeat filtering
[13:54:09] root: ERROR: An error occured while running Ragout:
[13:54:09] root: ERROR: No synteny blocks found in the target genome after repeat/indel filtering.

my recipe was


.references = fixstart_99,fixstart_98,fixstart_93,fixstart_84,fixstart_74,fixstart_73,fixstart_72,fixstart_61,fixstart_5,fixstart_57,fixstart_4,fixstart_110,fixstart_7
.target = assembly_ragout

fixstart_99.fasta = /nobackup/fbsmbaw/fixstart/fixstart_99/fixstart_99.fasta
fixstart_98.fasta = /nobackup/fbsmbaw/fixstart/fixstart_98/fixstart_98.fasta
fixstart_93.fasta = /nobackup/fbsmbaw/fixstart/fixstart_93/fixstart_93.fasta
fixstart_84.fasta = /nobackup/fbsmbaw/fixstart/fixstart_84/fixstart_84.fasta
fixstart_74.fasta = /nobackup/fbsmbaw/fixstart/fixstart_74/fixstart_74.fasta
fixstart_73.fasta = /nobackup/fbsmbaw/fixstart/fixstart_73/fixstart_73.fasta
fixstart_72.fasta = /nobackup/fbsmbaw/fixstart/fixstart_72/fixstart_72.fasta
fixstart_61.fasta = /nobackup/fbsmbaw/fixstart/fixstart_61/fixstart_61.fasta
fixstart_5.fasta = /nobackup/fbsmbaw/fixstart/fixstart_5/fixstart_5.fasta
fixstart_57.fasta = /nobackup/fbsmbaw/fixstart/fixstart_57/fixstart_57.fasta
fixstart_4.fasta = /nobackup/fbsmbaw/fixstart/fixstart_4/fixstart_4.fasta
fixstart_110.fasta = /nobackup/fbsmbaw/fixstart/fixstart_110/fixstart_110.fasta
fixstart_7.fasta = /nobackup/fbsmbaw/fixstart/fixstart_7/fixstart_7.fasta

assembly_ragout.fasta = assembly_ragout.fa

.blocks = small
.naming_ref = ragout

Do you have any thoughts on what I may be doing wrong?

mikolmogorov commented 1 year ago

@mattbawn does anything from the above apply? Could you share CheckM report as described above?

mattbawn commented 1 year ago

@fenderglass


Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity


fixstart_1 c__Bacilli (UID285) 586 325 181 1 320 4 0 0 0 99.86 2.21 0.00


It is a Mammalacoccus Lentus assembly (169 contigs) that I downloaded from NCBI. The reference strains are 13 related strains based on similarity by NCBI blast.


Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 + Completeness Contamination Strain heterogeneity


fixstart_74 cBacilli (UID285) 586 325 181 1 321 3 0 0 99.45 1.66 0.00 fixstart_73 c__Bacilli (UID285) 586 325 181 1 321 3 0 0 99.45 1.66 0.00 fixstart_72 cBacilli (UID285) 586 325 181 1 321 3 0 0 99.45 1.66 0.00 fixstart_99 cBacilli (UID285) 586 324 180 1 319 4 0 0 99.44 1.94 0.00 fixstart_93 c__Bacilli (UID285) 586 324 180 1 320 3 0 0 99.44 1.39 0.00 fixstart_84 cBacilli (UID285) 586 324 180 1 320 3 0 0 99.44 1.39 0.00 fixstart_61 cBacilli (UID285) 586 324 180 1 320 2 1 0 99.44 1.94 0.00 fixstart_57 c__Bacilli (UID285) 586 324 180 1 320 3 0 0 99.44 1.39 0.00 fixstart_5 cBacilli (UID285) 586 324 180 1 321 2 0 0 99.44 0.83 0.00 fixstart_4 cBacilli (UID285) 586 324 180 1 320 2 1 0 99.44 1.94 0.00 fixstart_110 c__Bacilli (UID285) 586 324 180 1 320 3 0 0 99.44 1.39 0.00 fixstart_7 cBacilli (UID285) 586 324 180 2 319 3 0 0 98.89 1.39 0.00 fixstart_98 c__Bacilli (UID285) 586 324 180 3 318 3 0 0 98.33 1.39 0.00


mikolmogorov commented 1 year ago

Thanks. One possible thing - do these references have unique sequence headers? If those are similar, Sibelia will consider them as a single long sequence.

If the headers are different, can you try reducing the number of references to 1 or 2? If this fails, can you share the Ragout output directory?

mikolmogorov commented 1 year ago

Closed due to inactivity, feel free to reopen if you are still having this issue.