rvicedomini / strainberry

Automated strain separation of low-complexity metagenomes
MIT License
47 stars 4 forks source link

Output assemblies are shorter than input #6

Open ilnamkang opened 3 years ago

ilnamkang commented 3 years ago

Hi,

I'm trying to apply strainberry to my nanopore data obtained from the sequencing of a bacteriophage culture.

metaFlye generated a single contig of ~80 kbp from the nanopore data. But, based on read mapping and the SPAdes assembly of short reads (and hybrid), I'm sure that multiple highly similar strains exist in the culture. That's why I'm trying to apply strainberry to this data.

Strainberry seemed to work fine, but the seven sequences in assembly.scaffolds.fa are shorter than the input sequence, ranging from 667 bp to 65,084 bp. I attach a graphic summary of Blastn between the input (query) and the strainberry output sequences.

strainberry

Below is the log of strainberry.

[2021-06-15 22:19:59] Starting Strainberry v1.1-a90d3b3 [2021-06-15 22:19:59] ### performing 2-strain separation [2021-06-15 22:19:59] SNP calling and phasing [2021-06-15 22:34:57] average Hamming ratio improved to 0.1926 [2021-06-15 22:34:57] separating reads [2021-06-15 22:41:00] assembling strain haplotypes [2021-06-16 06:10:36] scaffolding [2021-06-16 06:10:59] mapping reads to strain-separated scaffolds [2021-06-16 06:11:16] ### performing 3-strain separation [2021-06-16 06:11:16] SNP calling and phasing [2021-06-16 06:18:34] average Hamming ratio did not improve enough: 0.2098 [2021-06-16 06:18:34] Output strain-separated assembly available at: contig3.Sberry/assembly.scaffolds.fa [2021-06-16 06:18:34] Strainberry finished successfully

# Assembly statistics:

- Input assembly [contig3.fasta] Sequences: 1 Length: 80321 bp N50: 80321 bp

- Final iteration (2-strain separation) [contig3.Sberry/assembly.scaffolds.fa] Sequences: 7 Length: 151290 bp N50: 30447 bp

I'm wondering whether this result is usual or not. If not, how would I be able to handle the problem?

Thanks.

Ilnam

rvicedomini commented 3 years ago

Dear Ilnam,

First of all thanks for trying Strainberry with your data.

From the log you provided I don't see anything unusual. Straiberry seems able to identify two strains. This would mean that either there are only two strains collapsed by metaFlye in your input contig or there are more strains which have a very high sequence identity and Strainberry was not able to separate them. It is also possible to obtain a fragmented assembly in output and, due to the higher error ratio and the simple scaffolding procedure, I expect it to happen more frequently with Nanopore data than Pacbio.

I would also like to mention that you are testing Strainberry with a bacteriophage culture, which is not the use case of our tool (which was tested only with bacteria). In any case I don't think I can tell you more than this without having a look at the input data (the fasta and bam files you used).

If you can let me have access to such data I could have a deeper look to assess if there are some anomalies/bugs related to the tool. In this case, you can directly contact me at riccardo.vicedomini [at] pasteur.fr

To have a better understanding of what is happening, I would also suggest:

If you are willing to share anything above, you can contact me at the same e-mail address metioned above :) Except for the vcf file (which contains SNV nucleotides/positions), all other files do not contain any nucleotide sequence.

Best, Riccardo

ilnamkang commented 3 years ago

Dear Vicedomini,

Thank you for a kind reply.

Unfortunately, it would be not possible to share the input fasta and bam files, because these data are not my own.

Following your suggestion, I applied Strainberry to the assembly polished by Medaka and Pilon. (The overall procedures: Guppy basecalling using super-accurate model -> metaFlye -> one round of Medaka -> one round of Pilon -> Strainberry)

The result is a little different from the previous running, but output scaffolds are still shorter than the input sequence.

I'll send a tabular version of Blast result and some files from Strainberry you have specified in the reply via e-mail.

Thanks.

Ilnam

gitraffica commented 1 year ago

Did you solve this problem? I also can't observe the improvement of performance even after retrying over 1000 times in a month