Closed LeoVincenzi closed 1 year ago
Verkko only outputs completely phased assemblies, based on the HiFi but mostly ONT data since it is much longer. Do you know how high the diversity between the haplotypes is for this genome? If it is relatively similar or has large homozygous stretches, then the output of the contigs will be limited by the inability to phase across those long regions. I expect this is the reason for your assembly stats. You can confirm this by loading the noseq.gfa output file with Bandage and seeing the structure (you can also post a screenshot or the noseq.gfa file here). We typically use Hi-C or trio information to increase the phasing blocks and separate the haplotypes.
Hi @skoren, about the diversity between the haplotypes we do not have information about it. Anyway, we have tried to observe the structure on Bandage: we have observed that the graph tend to converge in a small contig (7 kb) as shown in fig1 and more detailed in fig2. Probably, we are in presence of a centromeric or telomeric contig.
Fig.1 Fig2
Moreover, I report also two examples (fig3 and fig4) of how the long regions are represented multiple times, fragmenting the assembly: how would you interpret these situation? Fig3 Fig4
Would you suggest to modify the way that we run the pipeline or do you think that perform purging (maybe with purge_haplotigs) would be the best way to improve the assembly?
Those pictures look mostly diploid, though not always. So, they look like larger unphased regions that aren't spanned and break the traversal. For the multi-way regions, those could be cases where more than two haplotypes are similar enough to be merged in the graph.
How much ONT coverage > 100kb did you have for this assembly? The best way to improve the assembly would likely be more ONT coverage to resolve these homozygous regions and/or adding something like Hi-C for phasing. Without that, I'd say verkko isn't well-designed for your use case since it cannot output pseudo-haplotypes since even with purge_dups, you wouldn't be increasing the continuity of the assembly
We have 3X of ONT reads > 100kb. For this reason we employed the whole dataset encompassing 38X of ONT reads (N50 44 Kbp) but probably, as you said, this is not enought to resolve homozygous regions. If i understood correcly, verrko breaks the path traversal (and thus the contigs) each time it is not able to phase haplotypes, is that correct? Many thanks for your suggestion and insights, it was much appreciated
Hi, authors. How does the "--screen" option work? Does it work on raw reads, or on assembled genomes? I have used verkko to run the hifi + ont, and used blastn to align the plastid genome to assembly.fasta. I found that many contigs contain plastid genome sequences with different lengths.
The screen option works on the assembly. Typically circular sequences will end up with varying length due to differences in the overlap around the circle. Screen handles this for you and circularizes the sequences.
Does it only affect ”7-consensus“?I want to re-run the verkko with this option.
Yes, you might also need to remove the final outputs (assembly.*
) to make snakemake do the right thing. You can test is by adding the --snakeopts --dry-run
option to verkko to see what it will run.
Ok, thanks. I will try.
Unfortunately, neither worked.
First run, I didn't add --snakeopts --dry-run
, the result was same as before.
Second run, with the --snakeopts --dry-run
, it didn't run.
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
The run involves checkpoint jobs, which will result in alteration of the DAG of jobs (e.g. adding more jobs) after their completion.
A dry run is just that, it's not supposed to run, it lists the steps it will run. I suggested using it to confirm all it will re-run is the consensus step before actually running the command w/the filtering step to make sure it won't unnecessarily redo any compute.
Thanks, I understand now! These jobs will be run:
Launching bioconda verkko bioconda 1.3.1`
Using snakemake 7.25.0.
Building DAG of jobs...
Job stats:
job count min threads max threads
---------------- ------- ------------- -------------
buildPackages 1 8 8
combineConsensus 1 1 1
extractONT 1 8 8
verkko 1 1 1
total 4 1 8
...
I found no steps to use the mito sequences.
The step which performs filtering is combineConsensus so that looks correct.
In this case, I suspect that the mitochondrial genome was too short and the contig sequence was too long to filter. I will continue to study other functions of verkko, thanks.
I'm confused, did you actually run without the dry-run flag and with the screen options to see what was output? What does it report in the *exemplar.fasta
files?
Yes, I got the same result as before, then I delet it, and run with --dry-run
.
Maybe I will run it again. I will report my results later.
It should produce additional files in the 7-consensus folder when run with screen, including a listing of all the hits and the files listing matches in fasta. Here's an example from human:
assembly.ebv.exemplar.fasta
assembly.ebv.fasta
assembly.ebv.ids
assembly.ebv.mashmap.err
assembly.ebv.mashmap.out
assembly.mito.exemplar.fasta
assembly.mito.fasta
assembly.mito.ids
assembly.mito.mashmap.err
assembly.mito.mashmap.out
assembly.rdna.exemplar.fasta
assembly.rdna.fasta
assembly.rdna.ids
assembly.rdna.mashmap.err
assembly.rdna.mashmap.out
if you don't have a set of files like that w/your input mito (it uses whatever name you give screen) it must have not been run correctly. If the matches are too small a fraction of a contig it is also possible those are real NUMT and not mito.
Sorry, I only checked the assembly.fasta file. Looking forward to new run.
These files were under 7-consensus:
assembly.disconnected.fasta combineConsensus.err packages.finished
assembly.disconnected.ids combineConsensus.out packages.readName_to_ID.map
assembly.fasta combineConsensus.sh packages.report
assembly.ids combined.fasta packages.tigName_to_ID.map
assembly.mito.exemplar.fasta combined.fasta.lengths screen-assembly.err
assembly.mito.fasta extractONT.err screen-assembly.out
assembly.mito.ids extractONT.sh unitig-popped.fasta
assembly.mito.mashmap.err ont_subset.extract unitig-popped.haplotype1.fasta
assembly.mito.mashmap.out ont_subset.fasta.gz unitig-popped.haplotype2.fasta
buildPackages.err ont_subset.id unitig-popped.unassigned.fasta
buildPackages.sh packages
But assembly.mito.exemplar.fasta
was empty.
Below is screen-assembly.out
file.
screen-assembly.txt
Yeah, this looks like there are no hits to the mito you supplied to verkko in the assembly. The mashmap logs (mashmap.err and mashmap.out) files will have more info on the run and why the sequences were not considered contaminants.
Sorry for the late reply.
./7-consensus/assembly.mito.mashmap.err
:
[mashmap] Reference = [combined.fasta]
[mashmap] Query = [/home/pxxiao/project/07_potato/01_data/00_other-genome/03_potato-plastid/potato.plastid.fasta]
[mashmap] Kmer size = 19
[mashmap] Sketch size = 20
[mashmap] Segment length = 10000 (read split allowed)
[mashmap] Block length min = 10000
[mashmap] Chaining gap max = 10000
[mashmap] Mappings per segment = 1
[mashmap] Percentage identity threshold = 95%
[mashmap] Do not skip self mappings
[mashmap] No hypergeometric filter
[mashmap] Mapping output file = assembly.mito.mashmap.out
[mashmap] Filter mode = 3 (1 = map, 2 = one-to-one, 3 = none)
[mashmap] Execution threads = 1
[mashmap::skch::Sketch::build] minmer windows picked from reference = 4104282
[mashmap::skch::Sketch::index] unique minmers = 1093877
[mashmap::skch::Sketch::computeFreqHist] Frequency histogram of minmer interval points = (2, 437352) ... (11516, 1)
[mashmap::skch::Sketch::computeFreqHist] With threshold 0.001%, ignore minmers occurring >= 3102 times during lookup.
[mashmap::map] time spent computing the reference index: 224.878 sec
[mashmap::skch::Map::mapQuery] WARNING, no .fai index found for /home/pxxiao/project/07_potato/01_data/00_other-genome/03_potato-plastid/potato.plastid.fasta, reading the file to sum sequence length (slow)
^M[mashmap::skch::Map::mapQuery] mapped 12.34% @ 6.37e+07 bp/s elapsed: 00:00:00:00 remain: 00:00:00:00^M[mashmap::skch::Map::mapQuery] mapped 12.34% @ 5.92e+07 bp/s elapsed: 00:00:00:00 remain: 00:00:00:00^M[mashmap::skch::Map::mapQuery] mapped 100.00% @ 1.26e+06 bp/s elapsed: 00:00:00:00 remain: 00:00:00:00
[mashmap::skch::Map::mapQuery] count of mapped reads = 0, reads qualified for mapping = 4, total input reads = 4, total input bp = 630009
[mashmap::map] time spent mapping the query: 5.24e-01 sec
[mashmap::map] mapping results saved in: assembly.mito.mashmap.out
And ./7-consensus/assembly.mito.mashmap.out
was empty.
Original issue was answered.
As for screen, it seems the provided plastid genome was too far diverged to be recruited from the assembly. By default, verkko requires at least 98% identity to the assembly and it seems the identity here was too low. You could try mapping manually to see what the identity of the hits are.
Dear authors, i refer to issue #135. I have tried to perform the assembly with verrko using a 50x PacBi HiFI sequencing data (> 10kb) and all the ONT reads I got (39x covergae) as you suggested. Unfortunately, I wasn't able to process PacBio data with DeepConsensus since the raw sequencing data weren't available.
Total assembly size (bp) | 1,551,045,287 -- | -- Num. Contigs | 13,960 Contigs average length (bp) | 111,106 N50 (bp) | 276,057 N90 (bp) | 39,510 Longest contig (bp) | 23,543,748The genome I obtained is much bigger than my expected size (600 bp) genome and it's really fragmented (table below).
For the genome size, the plant is polyploid, so I expect that a purging processing would help in collapse the haplotypes. But I don't know what to think about the high fragmentation.
The command I run was:
/opt/verkko/bin/verkko -d verkko_assembly --hifi Pacbio.hifi_reads_10kb.fasta --nano all_pass.fastq.gz
Is there maybe some parameter that should I consider to modify?
I just read about `--screen` option to remove mito and chloroplast genome as in #137 and I will apply it, but since the total dimension of both organelles is not bigger than 400 kb, I don't think it will change to much, particularly for the fragmentation. Thanks again,
Leonardo
Ps. no error results in the output.