Genome size greater and high fragmentation in genome assembly

LeoVincenzi commented 1 year ago

Dear authors, i refer to issue #135. I have tried to perform the assembly with verrko using a 50x PacBi HiFI sequencing data (> 10kb) and all the ONT reads I got (39x covergae) as you suggested. Unfortunately, I wasn't able to process PacBio data with DeepConsensus since the raw sequencing data weren't available.
The genome I obtained is much bigger than my expected size (600 bp) genome and it's really fragmented (table below).

Total assembly size (bp) | 1,551,045,287 -- | -- Num. Contigs | 13,960 Contigs average length (bp) | 111,106 N50 (bp) | 276,057 N90 (bp) | 39,510 Longest contig (bp) | 23,543,748

For the genome size, the plant is polyploid, so I expect that a purging processing would help in collapse the haplotypes. But I don't know what to think about the high fragmentation.

The command I run was:
/opt/verkko/bin/verkko -d verkko_assembly --hifi Pacbio.hifi_reads_10kb.fasta --nano all_pass.fastq.gz

Is there maybe some parameter that should I consider to modify?
I just read about `--screen` option to remove mito and chloroplast genome as in #137 and I will apply it, but since the total dimension of both organelles is not bigger than 400 kb, I don't think it will change to much, particularly for the fragmentation. Thanks again,
Leonardo

Ps. no error results in the output.

skoren commented 1 year ago

Verkko only outputs completely phased assemblies, based on the HiFi but mostly ONT data since it is much longer. Do you know how high the diversity between the haplotypes is for this genome? If it is relatively similar or has large homozygous stretches, then the output of the contigs will be limited by the inability to phase across those long regions. I expect this is the reason for your assembly stats. You can confirm this by loading the noseq.gfa output file with Bandage and seeing the structure (you can also post a screenshot or the noseq.gfa file here). We typically use Hi-C or trio information to increase the phasing blocks and separate the haplotypes.

LeoVincenzi commented 1 year ago

Hi @skoren, about the diversity between the haplotypes we do not have information about it. Anyway, we have tried to observe the structure on Bandage: we have observed that the graph tend to converge in a small contig (7 kb) as shown in fig1 and more detailed in fig2. Probably, we are in presence of a centromeric or telomeric contig.

Fig.1 Fig2

Moreover, I report also two examples (fig3 and fig4) of how the long regions are represented multiple times, fragmenting the assembly: how would you interpret these situation? Fig3 Fig4

Would you suggest to modify the way that we run the pipeline or do you think that perform purging (maybe with purge_haplotigs) would be the best way to improve the assembly?

skoren commented 1 year ago

Those pictures look mostly diploid, though not always. So, they look like larger unphased regions that aren't spanned and break the traversal. For the multi-way regions, those could be cases where more than two haplotypes are similar enough to be merged in the graph.

How much ONT coverage > 100kb did you have for this assembly? The best way to improve the assembly would likely be more ONT coverage to resolve these homozygous regions and/or adding something like Hi-C for phasing. Without that, I'd say verkko isn't well-designed for your use case since it cannot output pseudo-haplotypes since even with purge_dups, you wouldn't be increasing the continuity of the assembly

LeoVincenzi commented 1 year ago

We have 3X of ONT reads > 100kb. For this reason we employed the whole dataset encompassing 38X of ONT reads (N50 44 Kbp) but probably, as you said, this is not enought to resolve homozygous regions. If i understood correcly, verrko breaks the path traversal (and thus the contigs) each time it is not able to phase haplotypes, is that correct? Many thanks for your suggestion and insights, it was much appreciated

pxxiao-hz commented 1 year ago

Hi, authors. How does the "--screen" option work? Does it work on raw reads, or on assembled genomes? I have used verkko to run the hifi + ont, and used blastn to align the plastid genome to assembly.fasta. I found that many contigs contain plastid genome sequences with different lengths.

skoren commented 1 year ago

The screen option works on the assembly. Typically circular sequences will end up with varying length due to differences in the overlap around the circle. Screen handles this for you and circularizes the sequences.

pxxiao-hz commented 1 year ago

Does it only affect ”7-consensus“？I want to re-run the verkko with this option.

skoren commented 1 year ago

Yes, you might also need to remove the final outputs (assembly.*) to make snakemake do the right thing. You can test is by adding the --snakeopts --dry-run option to verkko to see what it will run.

pxxiao-hz commented 1 year ago

Ok, thanks. I will try.

pxxiao-hz commented 1 year ago

Unfortunately, neither worked. First run, I didn't add --snakeopts --dry-run, the result was same as before. Second run, with the --snakeopts --dry-run, it didn't run. This was a dry-run (flag -n). The order of jobs does not reflect the order of execution. The run involves checkpoint jobs, which will result in alteration of the DAG of jobs (e.g. adding more jobs) after their completion.

skoren commented 1 year ago

A dry run is just that, it's not supposed to run, it lists the steps it will run. I suggested using it to confirm all it will re-run is the consensus step before actually running the command w/the filtering step to make sure it won't unnecessarily redo any compute.

pxxiao-hz commented 1 year ago

Thanks, I understand now! These jobs will be run:

Launching bioconda verkko bioconda 1.3.1`
Using snakemake 7.25.0.
Building DAG of jobs...
Job stats:
job                 count    min threads    max threads
----------------  -------  -------------  -------------
buildPackages           1              8              8
combineConsensus        1              1              1
extractONT              1              8              8
verkko                  1              1              1
total                   4              1              8
...

I found no steps to use the mito sequences.

skoren commented 1 year ago

The step which performs filtering is combineConsensus so that looks correct.

pxxiao-hz commented 1 year ago

In this case, I suspect that the mitochondrial genome was too short and the contig sequence was too long to filter. I will continue to study other functions of verkko, thanks.

skoren commented 1 year ago

I'm confused, did you actually run without the dry-run flag and with the screen options to see what was output? What does it report in the *exemplar.fasta files?

pxxiao-hz commented 1 year ago

Yes, I got the same result as before, then I delet it, and run with --dry-run. Maybe I will run it again. I will report my results later.

skoren commented 1 year ago

It should produce additional files in the 7-consensus folder when run with screen, including a listing of all the hits and the files listing matches in fasta. Here's an example from human:

assembly.ebv.exemplar.fasta
assembly.ebv.fasta
assembly.ebv.ids
assembly.ebv.mashmap.err
assembly.ebv.mashmap.out
assembly.mito.exemplar.fasta
assembly.mito.fasta
assembly.mito.ids
assembly.mito.mashmap.err
assembly.mito.mashmap.out
assembly.rdna.exemplar.fasta
assembly.rdna.fasta
assembly.rdna.ids
assembly.rdna.mashmap.err
assembly.rdna.mashmap.out

if you don't have a set of files like that w/your input mito (it uses whatever name you give screen) it must have not been run correctly. If the matches are too small a fraction of a contig it is also possible those are real NUMT and not mito.

pxxiao-hz commented 1 year ago

Sorry, I only checked the assembly.fasta file. Looking forward to new run.

pxxiao-hz commented 1 year ago

These files were under 7-consensus:

assembly.disconnected.fasta   combineConsensus.err    packages.finished
assembly.disconnected.ids     combineConsensus.out    packages.readName_to_ID.map
assembly.fasta                combineConsensus.sh     packages.report
assembly.ids                  combined.fasta          packages.tigName_to_ID.map
assembly.mito.exemplar.fasta  combined.fasta.lengths  screen-assembly.err
assembly.mito.fasta           extractONT.err          screen-assembly.out
assembly.mito.ids             extractONT.sh           unitig-popped.fasta
assembly.mito.mashmap.err     ont_subset.extract      unitig-popped.haplotype1.fasta
assembly.mito.mashmap.out     ont_subset.fasta.gz     unitig-popped.haplotype2.fasta
buildPackages.err             ont_subset.id           unitig-popped.unassigned.fasta
buildPackages.sh              packages

But assembly.mito.exemplar.fasta was empty. Below is screen-assembly.out file. screen-assembly.txt

skoren commented 1 year ago

Yeah, this looks like there are no hits to the mito you supplied to verkko in the assembly. The mashmap logs (mashmap.err and mashmap.out) files will have more info on the run and why the sequences were not considered contaminants.

pxxiao-hz commented 1 year ago

Sorry for the late reply. ./7-consensus/assembly.mito.mashmap.err:

[mashmap] Reference = [combined.fasta]
[mashmap] Query = [/home/pxxiao/project/07_potato/01_data/00_other-genome/03_potato-plastid/potato.plastid.fasta]
[mashmap] Kmer size = 19
[mashmap] Sketch size = 20
[mashmap] Segment length = 10000 (read split allowed)
[mashmap] Block length min = 10000
[mashmap] Chaining gap max = 10000
[mashmap] Mappings per segment = 1
[mashmap] Percentage identity threshold = 95%
[mashmap] Do not skip self mappings
[mashmap] No hypergeometric filter
[mashmap] Mapping output file = assembly.mito.mashmap.out
[mashmap] Filter mode = 3 (1 = map, 2 = one-to-one, 3 = none)
[mashmap] Execution threads  = 1
[mashmap::skch::Sketch::build] minmer windows picked from reference = 4104282
[mashmap::skch::Sketch::index] unique minmers = 1093877
[mashmap::skch::Sketch::computeFreqHist] Frequency histogram of minmer interval points = (2, 437352) ... (11516, 1)
[mashmap::skch::Sketch::computeFreqHist] With threshold 0.001%, ignore minmers occurring >= 3102 times during lookup.
[mashmap::map] time spent computing the reference index: 224.878 sec
[mashmap::skch::Map::mapQuery] WARNING, no .fai index found for /home/pxxiao/project/07_potato/01_data/00_other-genome/03_potato-plastid/potato.plastid.fasta, reading the file to sum sequence length (slow)
^M[mashmap::skch::Map::mapQuery] mapped 12.34% @ 6.37e+07 bp/s elapsed: 00:00:00:00 remain: 00:00:00:00^M[mashmap::skch::Map::mapQuery] mapped 12.34% @ 5.92e+07 bp/s elapsed: 00:00:00:00 remain: 00:00:00:00^M[mashmap::skch::Map::mapQuery] mapped 100.00% @ 1.26e+06 bp/s elapsed: 00:00:00:00 remain: 00:00:00:00
[mashmap::skch::Map::mapQuery] count of mapped reads = 0, reads qualified for mapping = 4, total input reads = 4, total input bp = 630009
[mashmap::map] time spent mapping the query: 5.24e-01 sec
[mashmap::map] mapping results saved in: assembly.mito.mashmap.out

And ./7-consensus/assembly.mito.mashmap.out was empty.

skoren commented 1 year ago

Original issue was answered.

As for screen, it seems the provided plastid genome was too far diverged to be recruited from the assembly. By default, verkko requires at least 98% identity to the assembly and it seems the identity here was too low. You could try mapping manually to see what the identity of the hits are.

marbl / verkko

Genome size greater and high fragmentation in genome assembly #138