paoloshasta / shasta

De novo assembly from Oxford Nanopore reads.
https://paoloshasta.github.io/shasta/
Other
66 stars 9 forks source link

Telomere assembly #8

Closed paoloshasta closed 1 year ago

paoloshasta commented 1 year ago

@rob234king posted the following in #1

Another question on different data set: FYI: I'm only testing on a very small subset of reads at telomeres.

I have a telomere containing subset of reads and appears the default config Nanopore-R10-Slow-Nov2022 with 0.11.1 works well and I don't see errors in the assembly but ~101bp of the 3 prime end is not assembled and only part of the telomere is assembled. So two part question. Is there a parameter to get it to assemble to the length of the reads as coverage is good 101 bp to 3' end of sequence pileup? And to get more repeating sequence to assemble on the 5' end is there a parameter. As I know telomere continues at 5' I could just right something to extend it to that length but good if option to assemble more of it.

paoloshasta commented 1 year ago

Shasta haploid assembly prunes dangling ends of the marker graph because they are usually unreliable due to low coverage. This is controlled by command line option --MarkerGraph.pruneIterationCount. This has default value 6 markers, and is not modified by any of the built-in assembly configurations. If you are using one of the R10 assembly configurations, which don't use Run-Length-Encoding (RLE), one marker corresponds to 10 bases on average, and so this means that on average 60 bases would be pruned at each telomere. However this number is subject to high variability (more on this below).

You could try and reduce --MarkerGraph.pruneIterationCount to 0. This would eliminate the pruning, but it could also have the undesired effect of introducing unreliable sequence at the end of some assembled segments elsewhere in the assembly. This might not be too much of a problem with R10 reads.

The amount of telomeric sequence lost can be highly variable due to the low complexity of telomeric sequence. In an extreme situation, if you are out of luck and telomeric sequence contains no k-mers which have been selected to be used as markers, you could lose the entire telomere, plus a bit of flanking sequence. To reduce this effect you could increase marker density, for example from the typical 0.1 to 0.15 or 0.20. Marker density is controlled by command line option --Kmers.probability. However, increasing marker density can significantly increase the compute time and amount of memory required by the assembly.

No matter what you do here, you would lose all sequence before (or after) the first (or last) marker present in the telomere. This would be 5 bases on average (with --Kmers.probability set to the typical 0.1) but again is subject to high variability. The only way to completely eliminate this effect is to set --Kmers.probability to 1, but that would make assembly cost prohibitive, unless used for a very small assembly (e. g. by separating out telomeric reads).

Another problem you may encounter is the low coverage that will probably occur at the end of the telomere, although from your post you seem confident that this is not a problem. If after eliminating the pruning step you still get less sequence than you would like, we can talk about ways to reduce or eliminate this effect to optimize the assembly for lower coverage. But that could take a bit of experimentation.

rob234king commented 1 year ago

Thanks for that explanation. It worked very well. It now gives the desired effect by altering those parameters, its only a subset of telomere reads that I am playing with so can afford the speed adjustment although completes in just a few seconds. The accuracy looks good too.

rob234king commented 1 year ago

Trying the same thing on another data set and I get no output. Any ideas, what else I'm missing? worked well for another but more coverage this time. --threads 20 --Reads.minReadLength 4000 --input chr3_test1.fasta --assemblyDirectory shasta_08J --config Nanopore-R10-Slow-Nov2022 --MarkerGraph.pruneIterationCount 1 --Kmers.probability 0.80

For options in use for this assembly, see shasta.conf in the assembly directory. This run uses options "--memoryBacking 4K --memoryMode anonymous". This could result in performance degradation. For full performance, use "--memoryBacking 2M --memoryMode filesystem" (root privilege via sudo required). Therefore the results of this run should not be used for benchmarking purposes. This assembly will use 20 threads. Setting up consensus caller Modal Discarded read statistics for file /mmfs1/groups/apps/active/rking/scripts/chr3_PATERNAL_reverse.bam.fasta: Discarded 0 reads containing invalid bases for a total 0 valid bases. Discarded 9 reads shorter than 6000 bases for a total 43211 bases. Discarded 0 reads containing repeat counts 256 or more for a total 0 bases. Discarded read statistics for all input files: Discarded 0 reads containing invalid bases for a total 0 valid bases. Discarded 9 short reads for a total 43211 bases. Discarded 0 reads containing repeat counts 256 or more for a total 0 bases. Read statistics for reads that will be used in this assembly: Total number of reads is 422. Total number of raw bases is 4711327. Average read length is 11164.3 bases. N50 for read length is 11377 bases. Total number of k-mers of length 14 is 268435456 Of those, 187900606 will be used as markers. Fraction of k-mers used as markers: requested 0.7, actual 0.699984. Flagged 0 reads as palindromic out of 422 total. Palindromic fraction is 0 LowHash0 algorithm will use 2^22 = 4194304 buckets. Alignment candidates after lowhash iteration 0: high frequency 5, total 276, capacity 276. Alignment candidates after lowhash iteration 1: high frequency 15, total 3217, capacity 3218. Alignment candidates after lowhash iteration 2: high frequency 69, total 3604, capacity 4207. Alignment candidates after lowhash iteration 3: high frequency 178, total 3824, capacity 4570. Alignment candidates after lowhash iteration 4: high frequency 188, total 3824, capacity 4570. Alignment candidates after lowhash iteration 5: high frequency 213, total 4040, capacity 4846. Alignment candidates after lowhash iteration 6: high frequency 699, total 4529, capacity 5416. Alignment candidates after lowhash iteration 7: high frequency 706, total 4548, capacity 5571. Alignment candidates after lowhash iteration 8: high frequency 1040, total 5400, capacity 6422. Alignment candidates after lowhash iteration 9: high frequency 1063, total 5470, capacity 7051. Alignment candidates after lowhash iteration 10: high frequency 1066, total 5685, capacity 7343. Alignment candidates after lowhash iteration 11: high frequency 1069, total 5878, capacity 7631. Alignment candidates after lowhash iteration 12: high frequency 1081, total 5913, capacity 8003. Alignment candidates after lowhash iteration 13: high frequency 1119, total 6140, capacity 8394. Alignment candidates after lowhash iteration 14: high frequency 1384, total 6140, capacity 8394. Alignment candidates after lowhash iteration 15: high frequency 1448, total 6179, capacity 8501. Alignment candidates after lowhash iteration 16: high frequency 1501, total 6333, capacity 8621. Alignment candidates after lowhash iteration 17: high frequency 1501, total 6333, capacity 8621. Alignment candidates after lowhash iteration 18: high frequency 2614, total 6362, capacity 8631. Alignment candidates after lowhash iteration 19: high frequency 2679, total 6362, capacity 8631. Alignment candidates after lowhash iteration 20: high frequency 2827, total 6438, capacity 8894. Alignment candidates after lowhash iteration 21: high frequency 2965, total 6506, capacity 9078. Alignment candidates after lowhash iteration 22: high frequency 3197, total 6587, capacity 9214. Alignment candidates after lowhash iteration 23: high frequency 3201, total 6590, capacity 9226. Alignment candidates after lowhash iteration 24: high frequency 3201, total 6590, capacity 9226. Alignment candidates after lowhash iteration 25: high frequency 3201, total 6590, capacity 9226. Alignment candidates after lowhash iteration 26: high frequency 3201, total 6590, capacity 9226. Alignment candidates after lowhash iteration 27: high frequency 3201, total 6590, capacity 9226. Alignment candidates after lowhash iteration 28: high frequency 3201, total 6590, capacity 9226. Alignment candidates after lowhash iteration 29: high frequency 3209, total 6590, capacity 9226. Alignment candidates after lowhash iteration 30: high frequency 3314, total 6590, capacity 9226. Alignment candidates after lowhash iteration 31: high frequency 3437, total 6590, capacity 9226. Alignment candidates after lowhash iteration 32: high frequency 3451, total 6699, capacity 9345. Alignment candidates after lowhash iteration 33: high frequency 3561, total 6778, capacity 9706. Alignment candidates after lowhash iteration 34: high frequency 3562, total 6955, capacity 10054. Alignment candidates after lowhash iteration 35: high frequency 3562, total 6955, capacity 10054. Alignment candidates after lowhash iteration 36: high frequency 3592, total 6972, capacity 10088. Alignment candidates after lowhash iteration 37: high frequency 3593, total 7151, capacity 10239. Alignment candidates after lowhash iteration 38: high frequency 3593, total 7215, capacity 10489. Alignment candidates after lowhash iteration 39: high frequency 3593, total 7215, capacity 10489. Alignment candidates after lowhash iteration 40: high frequency 3593, total 7215, capacity 10489. Alignment candidates after lowhash iteration 41: high frequency 3593, total 7215, capacity 10489. Alignment candidates after lowhash iteration 42: high frequency 3593, total 7215, capacity 10489. Alignment candidates after lowhash iteration 43: high frequency 3596, total 7215, capacity 10489. Alignment candidates after lowhash iteration 44: high frequency 3600, total 7231, capacity 10489. Alignment candidates after lowhash iteration 45: high frequency 3661, total 7231, capacity 10489. Alignment candidates after lowhash iteration 46: high frequency 3817, total 7231, capacity 10489. Alignment candidates after lowhash iteration 47: high frequency 3871, total 7231, capacity 10489. Alignment candidates after lowhash iteration 48: high frequency 3871, total 7231, capacity 10489. Alignment candidates after lowhash iteration 49: high frequency 3871, total 7231, capacity 10489. Alignment candidates after lowhash iteration 50: high frequency 3882, total 7231, capacity 10489. Alignment candidates after lowhash iteration 51: high frequency 3889, total 7231, capacity 10489. Alignment candidates after lowhash iteration 52: high frequency 3889, total 7231, capacity 10489. Alignment candidates after lowhash iteration 53: high frequency 3891, total 7231, capacity 10489. Alignment candidates after lowhash iteration 54: high frequency 3891, total 7231, capacity 10489. Alignment candidates after lowhash iteration 55: high frequency 3891, total 7231, capacity 10489. Alignment candidates after lowhash iteration 56: high frequency 3891, total 7231, capacity 10489. Alignment candidates after lowhash iteration 57: high frequency 3959, total 7231, capacity 10489. Alignment candidates after lowhash iteration 58: high frequency 3966, total 7231, capacity 10489. Alignment candidates after lowhash iteration 59: high frequency 3966, total 7265, capacity 10497. Alignment candidates after lowhash iteration 60: high frequency 4148, total 7265, capacity 10497. Alignment candidates after lowhash iteration 61: high frequency 4148, total 7265, capacity 10497. Alignment candidates after lowhash iteration 62: high frequency 4165, total 7447, capacity 10651. Alignment candidates after lowhash iteration 63: high frequency 4165, total 7447, capacity 10651. Alignment candidates after lowhash iteration 64: high frequency 4165, total 7447, capacity 10651. Alignment candidates after lowhash iteration 65: high frequency 4165, total 7447, capacity 10651. Alignment candidates after lowhash iteration 66: high frequency 4245, total 7447, capacity 10651. Alignment candidates after lowhash iteration 67: high frequency 4245, total 7447, capacity 10651. Alignment candidates after lowhash iteration 68: high frequency 4251, total 7598, capacity 10760. Alignment candidates after lowhash iteration 69: high frequency 4251, total 7598, capacity 10760. Alignment candidates after lowhash iteration 70: high frequency 4382, total 7598, capacity 10760. Alignment candidates after lowhash iteration 71: high frequency 4636, total 7598, capacity 10760. Alignment candidates after lowhash iteration 72: high frequency 4648, total 7598, capacity 10760. Alignment candidates after lowhash iteration 73: high frequency 4648, total 7598, capacity 10760. Alignment candidates after lowhash iteration 74: high frequency 4680, total 7598, capacity 10760. Alignment candidates after lowhash iteration 75: high frequency 4684, total 7598, capacity 10760. Alignment candidates after lowhash iteration 76: high frequency 4684, total 7598, capacity 10760. Alignment candidates after lowhash iteration 77: high frequency 4688, total 7598, capacity 10760. Alignment candidates after lowhash iteration 78: high frequency 4688, total 7598, capacity 10760. Alignment candidates after lowhash iteration 79: high frequency 4688, total 7598, capacity 10760. Alignment candidates after lowhash iteration 80: high frequency 4826, total 7648, capacity 10784. Alignment candidates after lowhash iteration 81: high frequency 4827, total 7648, capacity 10784. Alignment candidates after lowhash iteration 82: high frequency 4827, total 7648, capacity 10784. Alignment candidates after lowhash iteration 83: high frequency 4848, total 7648, capacity 10784. Alignment candidates after lowhash iteration 84: high frequency 4862, total 7648, capacity 10784. Alignment candidates after lowhash iteration 85: high frequency 4862, total 7648, capacity 10784. Alignment candidates after lowhash iteration 86: high frequency 4863, total 7648, capacity 10784. Alignment candidates after lowhash iteration 87: high frequency 4863, total 7648, capacity 10784. Alignment candidates after lowhash iteration 88: high frequency 4863, total 7648, capacity 10784. Alignment candidates after lowhash iteration 89: high frequency 4864, total 7648, capacity 10784. Alignment candidates after lowhash iteration 90: high frequency 4864, total 7705, capacity 10829. Alignment candidates after lowhash iteration 91: high frequency 4864, total 7705, capacity 10829. Alignment candidates after lowhash iteration 92: high frequency 4932, total 7705, capacity 10829. Alignment candidates after lowhash iteration 93: high frequency 4946, total 7705, capacity 10829. Alignment candidates after lowhash iteration 94: high frequency 4946, total 7705, capacity 10829. Alignment candidates after lowhash iteration 95: high frequency 4946, total 7705, capacity 10829. Alignment candidates after lowhash iteration 96: high frequency 5144, total 7705, capacity 10829. Alignment candidates after lowhash iteration 97: high frequency 5145, total 7705, capacity 10829. Alignment candidates after lowhash iteration 98: high frequency 5145, total 7705, capacity 10829. Alignment candidates after lowhash iteration 99: high frequency 5145, total 7705, capacity 10829. Found 5145 alignment candidates. Average number of alignment candidates per oriented read is 12.1919. Number of alignment candidates before suppression is 5145 Suppressed 0 alignment candidates. Number of alignment candidates after suppression is 5145 Found and stored 0 good alignments. Keeping 0 alignments of 0 Of 844 vertices in the read graph, 0 are within distance 6 of their reverse complement. Found 0 strand jump regions. Marked 0 read graph edges out of 0 total as cross-strand. Flagged 0 reads as chimeric out of 422 total. Chimera rate is 0 The read graph has 844 connected components. Unable to automatically select MarkerGraph.minCoverage. No significant cutoff found in disjoint sets size distribution. Observed peak has percent total area of 0 minPercentArea is 0.08 See DisjointSetsHistogram.csv.Using MarkerGraph.minCoverage = 5 Kept 0 disjoint sets with coverage in the requested range. Found 0 disjoint sets with more than one marker on a single oriented read or with less than 0 supporting oriented reads on each strand. Found 0 edges for 0 vertices. The marker graph has 0 vertices and 0 edges. Flagged as weak 0 edges with coverage 1 and marker skip greater than 100 Transitive reduction removed 0 marker graph edges out of 0 total. The marker graph has 0 vertices and 0 strong edges. Pruned 0 edges at prune iteration 0. The original marker graph had 0 vertices and 0 edges. The number of surviving edges is 0. Begin simplifyMarkerGraph iteration 0 with maxLength = 10 Before iteration 0 part 1, the assembly graph has 0 vertices and 0 edges. Before iteration 0 part 2, the assembly graph has 0 vertices and 0 edges. Begin simplifyMarkerGraph iteration 1 with maxLength = 100 Before iteration 1 part 1, the assembly graph has 0 vertices and 0 edges. Before iteration 1 part 2, the assembly graph has 0 vertices and 0 edges. Begin simplifyMarkerGraph iteration 2 with maxLength = 1000 Before iteration 2 part 1, the assembly graph has 0 vertices and 0 edges. Before iteration 2 part 2, the assembly graph has 0 vertices and 0 edges. Begin simplifyMarkerGraph iteration 3 with maxLength = 10000 Before iteration 3 part 1, the assembly graph has 0 vertices and 0 edges. Before iteration 3 part 2, the assembly graph has 0 vertices and 0 edges. Begin simplifyMarkerGraph iteration 4 with maxLength = 100000 Before iteration 4 part 1, the assembly graph has 0 vertices and 0 edges. Before iteration 4 part 2, the assembly graph has 0 vertices and 0 edges. Removed 0 low coverage cross edges of the assembly graph and 0 corresponding marker graph edges. Before detangling, the assembly graph has 0 vertices and 0 edges. Found 0 tangles. The detangled assembly graph has 0 vertices. The detangled assembly graph has 0 edges. Removed 0 low coverage cross edges of the assembly graph and 0 corresponding marker graph edges. Using 20 threads. Assembly begins for 0 edges of the assembly graph. Assembled a total 0 bases for 0 assembly graph edges of which 0 were assembled. The assembly graph has 0 vertices and 0 edges of which 0 were assembled. Total length of assembled sequence is 0 This run used options "--memoryBacking 4K --memoryMode anonymous". This could have resulted in performance degradation. For full performance, use "--memoryBacking 2M --memoryMode filesystem" (root privilege via sudo required). Therefore the results of this run should not be used for benchmarking purposes. Shasta Release 0.11.1 2022-Dec-05 20:45:25.039605 Assembly ends.

paoloshasta commented 1 year ago

The code that automatically adjusts --MarkerGraph.minCoverage had trouble finding a good cutoff in the coverage distribution of marker graph vertices:

Unable to automatically select MarkerGraph.minCoverage. No significant cutoff found in disjoint sets size distribution. Observed peak has percent total area of 0

At a marker density of 80%, 8 times higher than in the unmodified Nanopore-R10-Slow-Nov2022, you will also need to increase by the same factor the alignment thresholds that are expressed in markers, in particular:

(Sorry that I forgot to mention this in my previous post).

Also, Nanopore-R10-Slow-Nov2022 is optimized for relatively low coverage around 40x, so it is possible that some adjustments are needed if you are at higher coverage.

Please attach the following to allow me to diagnose better:

rob234king commented 1 year ago

https://drive.google.com/drive/folders/12nrM38VnBwZGWLF_FZEq8ppmcatjjhDr?usp=share_link

Different data set and setting higher read length to reduce coverage but same story.

I'll try adjusting other parameters, thanks

rob234king commented 1 year ago

making adjustments by 8 fold and raising minimum read length to reduce coverage seems to do the trick

rob234king commented 1 year ago

But I lose the telomere sequence now. --config Nanopore-R10-Slow-Nov2022 --MarkerGraph.pruneIterationCount 1 --Kmers.probability 0.8 --Align.minAlignedMarkerCount 800 --Align.maxSkip 240 --Align.maxDrift 240 --Align.maxTrim 240

I'll play with the parameters, maybe too strict now.

paoloshasta commented 1 year ago

I just noticed that you are working with a very, very small number of reads. Because of that, you could also try --MinHash.allPairs. That will cause all pairs of reads to be considered as alignment candidates, instead of using the MinHash algorithm. Normally this is prohibitive from a performance point of view.

Independently of that, look at the number of "good alignments" in AssemblySummary.html. For a healthy assembly that is usually around 5 to 10 times the number of reads. If lower, you need to relax alignment criteria. I never experimented much with marker density that high, so I have little intuition for it. The fraction of isolated reads (also reported in AssemblySummary.html) is also a useful indicator and should be below 20-30% for a healthy assembly.

If you want to dig more, you can also use the Shasta http server functionality to look at assembly details - for example, alignments, marker graph details, etc. See here for more information on that.

rob234king commented 1 year ago

ok thanks, I'll have a play about a bit. Thanks for your time. Looks like solvable.

rob234king commented 1 year ago

I'm trying to explore the assembly but can't get the localhost to load up. Just wondering if you have seen this error before. I'm asking IT if a firewall problem. Chrome "This site can’t be reached" but firefox tries to load it but just keeps working and doesn't seem to progress. Below is my command and output. I assume this is the problem "glxtest: libEGL no display"

./shasta-Linux-0.11.1 --threads 20 --Reads.minReadLength 8000 --input chr3_PATERNAL_reverse.bam.fasta --assemblyDirectory shasta_08 --memoryBacking disk --memoryMode filesystem --config Nanopore-R10-Slow-Nov2022 --MarkerGraph.pruneIterationCount 1 --Kmers.probability 0.8 --Align.minAlignedMarkerCount 800 --Align.maxSkip 120 --Align.maxDrift 120 --Align.maxTrim 120 --command explore
Shasta Release 0.11.1
Setting up consensus caller Modal
Documentation is not available.
Listening for http requests on port 17100
To connect, point your browser to http://localhost:17100
Only accepting local connections originating from a process owned by the same user running the server.
[GFX1-]: glxtest: libEGL no display
[GFX1-]: More than 2 GPUs detected via PCI, secondary GPU is arbitrary
rob234king commented 1 year ago

ok managed to get it working

rob234king commented 1 year ago

My marker graph is empty basically but how should I investigate why that is? I have a read graph.

Markers

Total number of markers on all reads, one strand | 3865013 -- | -- Total number of markers on all reads, both strands | 7730026 Average number of markers per base | 0.8395 Average base offset between markers | 1.191 Average base gap between markers | -12.81
paoloshasta commented 1 year ago

That summary shows the markers that were used during the assembly. The marker graph summary appears later in AssemblySummary.html.

The numbers you posted are normal for the options you are using. They say that the marker density (number of markers per base) is 0.84. This should be close to the value you specified using --Kmers.probability. This means that the average spacing between markers is 1.19 bases. Since you are using 14-base markers, the average overlap between successive markers is 14-1.91=12.81 bases, that is a gap of -12.81 bases.

If you want to look at details of the marker graph at a selected location, there is a section for that in the top menu. One of the options in the menu is to show details of the marker graph around a given vertex. You can specify the marker graph vertex by number, but there are two more useful ways to get there:

paoloshasta commented 1 year ago

For full functionality of the Shasta http server you will need to install Graphviz and Gnuplot. If you are on Ubuntu you can use shasta/scripts/InstallPrerequisites-Ubuntu.sh, but that installs a lot more software that you don't need, and that is only needed if you want to build Shasta.

rob234king commented 1 year ago

I'm still not getting in some cases the telomere sequence itself assembled. It seems to be cut off and I can't see why, any ideas? Of this part "Assembled a total 16123 bases for 2 assembly graph edges of which 1 were assembled." I'm hoping the other edge is my telomere. Is there a control to assemble all remaining edges?

Automatically selected value of MarkerGraph.minCoverage is 67 Kept 24284 disjoint sets with coverage in the requested range. Found 114 disjoint sets with more than one marker on a single oriented read or with less than 0 supporting oriented reads on each strand. Found 28718 edges for 24170 vertices. The marker graph has 24170 vertices and 28718 edges. Flagged as weak 20 edges with coverage 1 and marker skip greater than 100 Flagged as weak 3874 edges with coverage 1 out of 3958 total. Flagged as weak 400 edges with coverage 2 out of 400 total. Flagged as weak 104 edges with coverage 3 out of 104 total. Flagged as weak 38 edges with coverage 4 out of 38 total. Flagged as weak 18 edges with coverage 5 out of 18 total. Flagged as weak 4 edges with coverage 6 out of 4 total. Flagged as weak 10 edges with coverage 7 out of 10 total. Flagged as weak 6 edges with coverage 8 out of 6 total. Flagged as weak 6 edges with coverage 9 out of 6 total. Flagged as weak 2 edges with coverage 10 out of 2 total. Flagged as weak 2 edges with coverage 11 out of 2 total. Flagged as weak 2 edges with coverage 15 out of 2 total. Transitive reduction removed 4486 marker graph edges out of 28718 total. The marker graph has 24170 vertices and 24232 strong edges. Pruned 4 edges at prune iteration 0. The original marker graph had 24170 vertices and 28718 edges. The number of surviving edges is 24228. Begin simplifyMarkerGraph iteration 0 with maxLength = 10 Before iteration 0 part 1, the assembly graph has 132 vertices and 194 edges. Before iteration 0 part 2, the assembly graph has 132 vertices and 194 edges. Begin simplifyMarkerGraph iteration 1 with maxLength = 100 Before iteration 1 part 1, the assembly graph has 132 vertices and 194 edges. Before iteration 1 part 2, the assembly graph has 32 vertices and 44 edges. Begin simplifyMarkerGraph iteration 2 with maxLength = 1000 Before iteration 2 part 1, the assembly graph has 12 vertices and 14 edges. Before iteration 2 part 2, the assembly graph has 4 vertices and 2 edges. Begin simplifyMarkerGraph iteration 3 with maxLength = 10000 Before iteration 3 part 1, the assembly graph has 4 vertices and 2 edges. Before iteration 3 part 2, the assembly graph has 4 vertices and 2 edges. Begin simplifyMarkerGraph iteration 4 with maxLength = 100000 Before iteration 4 part 1, the assembly graph has 4 vertices and 2 edges. Before iteration 4 part 2, the assembly graph has 4 vertices and 2 edges. Removed 0 low coverage cross edges of the assembly graph and 0 corresponding marker graph edges. Before detangling, the assembly graph has 4 vertices and 2 edges. Found 0 tangles. The detangled assembly graph has 4 vertices. The detangled assembly graph has 2 edges. Removed 0 low coverage cross edges of the assembly graph and 0 corresponding marker graph edges. Using 20 threads. Assembly begins for 2 edges of the assembly graph. Assembled a total 16123 bases for 2 assembly graph edges of which 1 were assembled. The assembly graph has 4 vertices and 2 edges of which 1 were assembled.

paoloshasta commented 1 year ago

It's probably due to this:

Automatically selected value of MarkerGraph.minCoverage is 67

This is the coverage threshold for a marker graph vertex to be generated. It is controlled by command line option --MarkerGraph.minCoverage. If that is set to 0, like in your case, it triggers an algorithm to determine the threshold automatically based on the coverage histogram of the "disjoint sets" obtained during marker graph creation. That histogram is visible in file DisjointSetsHistogram.csv, and in a large assembly typically looks like this:

image

This is similar to k-mer histogram coverage plots and has a similar intepretation. In a case like that, the automatically selected value would be about 9, to avoid the peak due to errors on the left.

When the histogram is that clean, the automatic determination of the threshold usually works well. However, in your case, with a very small number of reads, the histogram will be very noisy, and so the automatic determination of the threshold could be completely off. A value of 67 sets a very stringent criterion to generate a marker graph vertex - that is, you are requiring the first assembled marker to appear in at least 67 reads. This explains why you are missing sequence. With this in mind, my recommendation is the following:

  1. Look at the histogram in DisjointSetsHistogram.csv and decide on a reasonable value for --MarkerGraph.minCoverage. This should be a value that excluded the error peak on the left, but not too high and most likely lower than 67. The higher the value, the more conservative the generation of marker graph vertices.
  2. After running with the value you chose, look at the resulting histograms of marker graph vertex and edge coverage (in MarkerGraphVertexCoverageHistogram.csv and MarkerGraphEdgeCoverageHistogram.csv) for a sanity check.

If you use the Shasta http server to look at details of the marker graph, you will see that in your current assembly the first vertex in the graph (assuming you turned off pruning) has high coverage, at least 67. Anything on the left will not be assembled.

Of this part "Assembled a total 16123 bases for 2 assembly graph edges of which 1 were assembled." I'm hoping the other edge is my telomere. Is there a control to assemble all remaining edges?

The marker graph contains both strands, and only one is assembled. There are two copies of each segment, reverse complemented relative to each other, and only one is assembled (the sequence of the one that is not assembled is obtained by reverse complementing). Assembly.fasta and Assembly.gfa contains only the copies that were actually assembled. Assembly-BothStrands.gfa contains both copies. AssemblySummary.csv contains a row for each segment that was assembled, and it also gives the id of the reverse complemented segment.

rob234king commented 1 year ago

Thanks I had a look at the histogram is fine, 67 is value that is suitable. Lowering it to 2 I can get telomere assembled as separate contig but the histogram is clean in terms of two peaks, one at the start which is a described.

It seems like some other parameter but I'll keep looking.

paoloshasta commented 1 year ago

You should be able to track the details using the Shasta http server functionality. Look at details of the marker graph near the beginning of the first contig (you can locate that using the assembly graph display). Then you can see what is happening to the reads in that region. Are marker graph vertices being generated? Are they being connected with edges? Under Marker graph there is an option to "follow" a read in the marker graph which could be useful for this.

rob234king commented 1 year ago

ok thanks, i'll have a look at that.

paoloshasta commented 1 year ago

I am closing this due to lack of recent discussion. Feel free to reopen it or create a new issue if additional topics emerge.