marbl / verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
294 stars 29 forks source link

AssertionError in processONT step with plant genome #37

Closed baozg closed 2 years ago

baozg commented 2 years ago

Hi,

I am using Arabidopsis thaliana data for testing with default parameter. I can run the Ecoli data successfully. The public data is 140x HiFi (22Gb, reads N50 15kb) and 115x ONT (26G, N50 82kb) . But verkko stop after aligning ONT data to the graph.

Here is the log of building graph

# buildGraph.err
22222 unitigs after resolving
Building unitig sequences
Reading sequences from ../0-correction/hifi-corrected.fasta
Writing graph to ../1-buildGraph/hifi-resolved.gfa
Writing paths to ../1-buildGraph/paths.gaf
selecting k-mers and building graph topology took 544,192 s
unitigifying took 16,372 s
filtering unitigs took 0,567 s
getting read paths took 31,765 s
resolving unitigs took 52310,996 s
building unitig sequences took 529,209 s
forcing edge consistency took 2,672 s
writing the graph and calculating stats took 4,590 s
writing sequence paths took 15,649 s
nodes: 22222
edges: 16388
assembly size 153868711 bp, N50 100780
approximate number of k-mers ~ 131624489

# processONT.err

Step 1a

Step 2a
inserted 198 gaps

Step 2b

Step 2c
Traceback (most recent call last):
  File "/public10/home/sci0009/miniconda3/envs/snakemake/lib/verkko/scripts/calculate_coverage.py", line 41, in <module>
    assert nodename in node_sizes
AssertionError
skoren commented 2 years ago

This should be fixed in 83e67a2bf73650f5793783f8ec0fff48720eaa7a. You can either apply the patch to the condo-installed scripts (the file /public10/home/sci0009/miniconda3/envs/snakemake/lib/verkko/Snakefiles/4-processONT.sm) or you can download the latest from the GitHub repo and compile following the instructions on the README.

baozg commented 2 years ago

Thanks for promptly response. It work well in 4-processONT.sm, but it throw error in the step 6-layoutContigs.

Traceback (most recent call last):
  File "/public10/home/sci0009/miniconda3/envs/snakemake/lib/verkko/scripts/get_layout_from_mbg.py", line 276, in <module>
    matches = get_matches(path, node_poses, contig_nodeseqs, raw_node_lens, edge_overlaps, pathleftclip, pathrightclip, readleftclip,
  File "/public10/home/sci0009/miniconda3/envs/snakemake/lib/verkko/scripts/get_layout_from_mbg.py", line 112, in get_matches
    assert read_end_pos > read_start_pos
AssertionError
skoren commented 2 years ago

Ah that'd be the next bug fix: 24e20b6fde5647843868e6bc3be865efa1143c82. Again you can edit but I might suggest getting the latest from repo because there are a couple of other bug fixes since the beta there.

baozg commented 2 years ago

OK, I will replace the all the code of conda environment. Thanks!

baozg commented 2 years ago

After replace the scripts, I can run the verkko pipeline with the Arabidopsis thaliana public HiFi and ONT data. But the 7-consensus/unitig-popped.fasta size is very strange, it has 65 contigs (12,296,409 only).

The data I use is the https://ngdc.cncb.ac.cn/gsa/browse/CRA004538 I filter out the ONT reads less than <50kb. All other parameters are default of verkko.

skoren commented 2 years ago

Interesting, I'll try launching the data locally to see what happens. What are the sizes of the gfa files you have (in 4-processONT and 5-untip)?

baozg commented 2 years ago
# 4-processONT
-rw-rw-r-- 1 sci0009 sci0009 8.1M Jan 24 23:51 combined-edges-uniques.gfa
-rw-rw-r-- 1 sci0009 sci0009 151M Jan 24 23:52 connected.gfa
-rw-rw-r-- 1 sci0009 sci0009 151M Jan 24 23:51 gapped-unitig-unrolled-hifi-resolved.gfa
-rw-rw-r-- 1 sci0009 sci0009 151M Jan 24 23:52 normal-connected.gfa
-rw-rw-r-- 1 sci0009 sci0009  94M Jan 24 23:52 ont-resolved-graph.gfa
-rw-rw-r-- 1 sci0009 sci0009  94M Jan 24 23:52 unitig-unrolled-ont-resolved.gfa
-rw-rw-r-- 1 sci0009 sci0009  94M Jan 24 23:52 unrolled-ont-resolved.gfa

# 5-untip
-rw-rw-r-- 1 sci0009 sci0009 11M Jan 24 23:52 combined-edges-1.gfa
-rw-rw-r-- 1 sci0009 sci0009 11M Jan 24 23:52 combined-edges-final.gfa
-rw-rw-r-- 1 sci0009 sci0009 93M Jan 24 23:52 connected-tip.gfa
-rw-rw-r-- 1 sci0009 sci0009 93M Jan 24 23:52 popped-unitig-normal-connected-tip.gfa
-rw-rw-r-- 1 sci0009 sci0009 93M Jan 24 23:52 unitig-normal-connected-tip.gfa
-rw-rw-r-- 1 sci0009 sci0009 93M Jan 24 23:52 unitig-popped-unitig-normal-connected-tip.gfa
skoren commented 2 years ago

Based on that the assembly should be about 130 mbp so the issue is in the consensus step. Perhaps snakemake had an issue with the code changes mid-run. Can you remove the 6-layoutContigs and 7-consensus folders and re-run. If it's still too small, can you share the 6-layoutContigs/uniting-popped.layout file here?

baozg commented 2 years ago

Actually, when I replace the scripts, I did remove the those two folders before snakemake running. I try it again. The 7-consensus/unitig-popped.fast is still 12Mb.

Here is the 6-layoutContigs/uniting-popped.layout unitig-popped.layout.txt

skoren commented 2 years ago

OK, that file is the issue, it's too small and has too few contigs. I suspect the bug fix to address your crash in 6-layoutContigs has another issue causing the file truncation. I've downloaded the data so I'll test locally and see if I can reproduce it.

baozg commented 2 years ago

The MBG is much slow than human and consume the most time for this data (15 hours with k=1001 and w=100 ),maybe you can try other parameters.

skoren commented 2 years ago

I ran this assembly and got a final fasta >200mb in size. I didn't change any parameters and used the full dataset. So the question is why is your 6-layoutContigs result truncated. Can you run ls -lha */6-layout*/, ls -lha */7-consensus/, and wc -l */6-layout*/*. Post the 6-layoutContigs/createLayouts.err file as well. Here are the results of the above on my assembly:

% ls -lha beta_asm/6-layoutContigs/
total 246M
drwxrwxr-x 2 korens Phillippy 4.0K Jan 25 23:44 .
drwxrwxr-x 2 korens Phillippy 4.0K Jan 26 06:09 ..
-rw-rw-r-- 1 korens Phillippy 162M Jan 25 23:41 combined-alignments.gaf
-rw-rw-r-- 1 korens Phillippy  12M Jan 25 23:41 combined-edges.gfa
-rw-rw-r-- 1 korens Phillippy 887K Jan 25 23:41 combined-nodemap.txt
-rw-rw-r-- 1 korens Phillippy  29K Jan 25 23:41 consensus_paths.txt
-rw-rw-r-- 1 korens Phillippy 131K Jan 25 23:44 createLayout.err
-rwxrwxr-x 1 korens Phillippy 3.4K Jan 25 23:41 createLayout.sh
-rw-rw-r-- 1 korens Phillippy   34 Jan 25 23:45 gaps.txt
-rw-rw-r-- 1 korens Phillippy 3.0M Jan 25 23:41 nodelens.txt
-rw-rw-r-- 1 korens Phillippy  68M Jan 25 23:45 unitig-popped.layout
% ls -lha beta_asm/7-consensus/
total 393M
drwxrwxr-x 2 korens Phillippy 4.0K Jan 26 06:09 .
drwxrwxr-x 2 korens Phillippy 4.0K Jan 26 06:09 ..
-rw-rw-r-- 1 korens Phillippy 465K Jan 26 00:22 buildPackages.err
-rwxrwxr-x 1 korens Phillippy  397 Jan 25 23:47 buildPackages.sh
-rw-rw-r-- 1 korens Phillippy  304 Jan 26 06:09 combineConsensus.err
-rwxrwxr-x 1 korens Phillippy  357 Jan 26 06:09 combineConsensus.sh
-rw-rw-r-- 1 korens Phillippy 1008 Jan 25 23:40 extractONT.err
-rwxrwxr-x 1 korens Phillippy 1.8K Jan 25 23:38 extractONT.sh
-rw-rw-r-- 1 korens Phillippy 1.8K Jan 25 23:41 ont_subset.extract
-rw-rw-r-- 1 korens Phillippy 105M Jan 25 23:41 ont_subset.fasta.gz
-rw-rw-r-- 1 korens Phillippy 160K Jan 25 23:41 ont_subset.id
drwxrwxr-x 2 korens Phillippy 4.0K Jan 26 00:27 packages
-rw-rw-r-- 1 korens Phillippy    0 Jan 26 00:22 packages.finished
-rw-rw-r-- 1 korens Phillippy  80M Jan 26 00:12 packages.readName_to_ID.map
-rw-rw-r-- 1 korens Phillippy  24K Jan 26 00:22 packages.tigName_to_ID.map
-rw-rw-r-- 1 korens Phillippy 209M Jan 26 06:09 unitig-popped.fasta
% wc -l beta_asm/6-layoutContigs/*
  1660064 beta_asm/6-layoutContigs/combined-alignments.gaf
   348248 beta_asm/6-layoutContigs/combined-edges.gfa
    27146 beta_asm/6-layoutContigs/combined-nodemap.txt
     1369 beta_asm/6-layoutContigs/consensus_paths.txt
     1369 beta_asm/6-layoutContigs/createLayout.err
       64 beta_asm/6-layoutContigs/createLayout.sh
        1 beta_asm/6-layoutContigs/gaps.txt
   211548 beta_asm/6-layoutContigs/nodelens.txt
  1430506 beta_asm/6-layoutContigs/unitig-popped.layout
  3680315 total
baozg commented 2 years ago

$ ls -lha 6-layoutContigs/*
-rw-rw-r-- 1 sci0009 sci0009 161M Jan 25 01:49 6-layoutContigs/combined-alignments.gaf
-rw-rw-r-- 1 sci0009 sci0009  11M Jan 25 01:49 6-layoutContigs/combined-edges.gfa
-rw-rw-r-- 1 sci0009 sci0009 836K Jan 25 01:49 6-layoutContigs/combined-nodemap.txt
-rw-rw-r-- 1 sci0009 sci0009 8.0K Jan 25 01:49 6-layoutContigs/consensus_paths.txt
-rw-rw-r-- 1 sci0009 sci0009  78K Jan 25 01:50 6-layoutContigs/createLayout.err
-rwxrwxr-x 1 sci0009 sci0009 3.4K Jan 25 01:49 6-layoutContigs/createLayout.sh
-rw-rw-r-- 1 sci0009 sci0009  365 Jan 25 01:50 6-layoutContigs/gaps.txt
-rw-rw-r-- 1 sci0009 sci0009 2.9M Jan 25 01:50 6-layoutContigs/nodelens.txt
-rw-rw-r-- 1 sci0009 sci0009  40K Jan 25 01:50 6-layoutContigs/unitig-popped.layout

$ ls -lha 7-consensus/
total 46M
drwxrwxr-x  3 sci0009 sci0009 4.0K Jan 25 01:57 .
drwxrwxr-x 11 sci0009 sci0009 4.0K Jan 25 01:49 ..
-rw-rw-r--  1 sci0009 sci0009 289K Jan 25 01:55 buildPackages.err
-rwxrwxr-x  1 sci0009 sci0009  411 Jan 25 01:50 buildPackages.sh
-rw-rw-r--  1 sci0009 sci0009  342 Jan 25 01:57 combineConsensus.err
-rwxrwxr-x  1 sci0009 sci0009  403 Jan 25 01:57 combineConsensus.sh
-rw-rw-r--  1 sci0009 sci0009  432 Jan 25 01:50 extractONT.err
-rwxrwxr-x  1 sci0009 sci0009 1.3K Jan 25 01:49 extractONT.sh
-rw-rw-r--  1 sci0009 sci0009  774 Jan 25 01:50 ont_subset.extract
-rw-rw-r--  1 sci0009 sci0009  34M Jan 25 01:50 ont_subset.fasta.gz
-rw-rw-r--  1 sci0009 sci0009  45K Jan 25 01:50 ont_subset.id
drwxrwxr-x  2 sci0009 sci0009 4.0K Jan 25 01:56 packages
-rw-rw-r--  1 sci0009 sci0009    0 Jan 25 01:55 packages.finished
-rw-rw-r--  1 sci0009 sci0009  45K Jan 25 01:55 packages.readName_to_ID.map
-rw-rw-r--  1 sci0009 sci0009 1.5K Jan 25 01:55 packages.tigName_to_ID.map
-rw-rw-r--  1 sci0009 sci0009  12M Jan 25 01:57 unitig-popped.fasta

$ wc -l 6-layoutContigs/*
  1656796 6-layoutContigs/combined-alignments.gaf
   308801 6-layoutContigs/combined-edges.gfa
    22556 6-layoutContigs/combined-nodemap.txt
      400 6-layoutContigs/consensus_paths.txt
      400 6-layoutContigs/createLayout.err
       64 6-layoutContigs/createLayout.sh
        9 6-layoutContigs/gaps.txt
   197148 6-layoutContigs/nodelens.txt
     1011 6-layoutContigs/unitig-popped.layout
  2187185 total

createLayout.err.txt

baozg commented 2 years ago

By the way, what's the size and N50 of your Col-0 assembly ? It seems much large than the Col-0 reference.

skoren commented 2 years ago

The N50 is pretty meaningless on this assembly, it's the same as the reference. Total BP is 218 mb mostly because there are two tangles, the mitochondria which is very high coverage and probably has variation within the sample and a repeat at the end of Chromosome 4. Ignoring unitigs <1mb the assembly is 134 Mbp and all chromosomes are in a single uniting except Chromosome 5 which is split into two pieces.

The consensus_paths.txt are too small, these come from the 5-untip step. Can you post the file sizes and wc on the files in that directory along with untip.err? Did you ever run out of space during this run?

baozg commented 2 years ago

Wow, verkko resolve the 45S rDNA cluster in the Chr2. Organelle always confusing the nuclear genome assembly in the plant genome. I did not run out of space during this run. Maybe I can restart the whole pipeline before the ONT read mapping.

$ ls -lha 5-untip/*
-rw-rw-r-- 1 sci0009 sci0009  11M Jan 24 23:52 5-untip/combined-edges-1.gfa
-rw-rw-r-- 1 sci0009 sci0009  11M Jan 24 23:52 5-untip/combined-edges-final.gfa
-rw-rw-r-- 1 sci0009 sci0009 1.5M Jan 24 23:52 5-untip/combined-nodemap-1.txt
-rw-rw-r-- 1 sci0009 sci0009 1.5M Jan 24 23:52 5-untip/combined-nodemap-final.txt
-rw-rw-r-- 1 sci0009 sci0009  93M Jan 24 23:52 5-untip/connected-tip.gfa
-rwxrwxr-x 1 sci0009 sci0009  741 Jan 24 23:52 5-untip/getCoverage.unitig-popped-unitig-normal-connected-tip.sh
-rwxrwxr-x 1 sci0009 sci0009  721 Jan 24 23:52 5-untip/getCoverage.unitig-unrolled-hifi-resolved.sh
-rwxrwxr-x 1 sci0009 sci0009  709 Jan 24 23:52 5-untip/getCoverage.unitig-unrolled-ont-resolved.sh
-rw-rw-r-- 1 sci0009 sci0009 7.8K Jan 24 23:52 5-untip/nodecov_hifi_fix.csv
-rw-rw-r-- 1 sci0009 sci0009 2.9M Jan 24 23:52 5-untip/nodelens-1.txt
-rw-rw-r-- 1 sci0009 sci0009 2.9M Jan 24 23:52 5-untip/nodelens-final.txt
-rw-rw-r-- 1 sci0009 sci0009 1.8K Jan 24 23:52 5-untip/popinfo.err
-rw-rw-r-- 1 sci0009 sci0009  93M Jan 24 23:52 5-untip/popped-unitig-normal-connected-tip.gfa
-rw-rw-r-- 1 sci0009 sci0009    0 Jan 24 23:52 5-untip/prepCoverage.err
-rwxrwxr-x 1 sci0009 sci0009 3.0K Jan 24 23:52 5-untip/prepCoverage.sh
-rw-rw-r-- 1 sci0009 sci0009  14K Jan 24 23:52 5-untip/unitig-mapping-3.txt
-rw-rw-r-- 1 sci0009 sci0009  11K Jan 24 23:52 5-untip/unitig-mapping-4.txt
-rw-rw-r-- 1 sci0009 sci0009  93M Jan 24 23:52 5-untip/unitig-normal-connected-tip.gfa
-rw-rw-r-- 1 sci0009 sci0009    0 Jan 24 23:52 5-untip/unitig-popped-unitig-normal-connected-tip.getCoverage.err
-rw-rw-r-- 1 sci0009 sci0009  93M Jan 24 23:52 5-untip/unitig-popped-unitig-normal-connected-tip.gfa
-rw-rw-r-- 1 sci0009 sci0009 5.0K Jan 24 23:52 5-untip/unitig-popped-unitig-normal-connected-tip.hifi-coverage.csv
-rw-rw-r-- 1 sci0009 sci0009 7.4K Jan 24 23:52 5-untip/unitig-popped-unitig-normal-connected-tip.ont-coverage.csv
-rw-rw-r-- 1 sci0009 sci0009  125 Jan 24 23:52 5-untip/untip.err
-rwxrwxr-x 1 sci0009 sci0009 3.0K Jan 24 23:52 5-untip/untip.sh

$ wc -l 5-untip/*
   308179 5-untip/combined-edges-1.gfa
   308801 5-untip/combined-edges-final.gfa
    42264 5-untip/combined-nodemap-1.txt
    42664 5-untip/combined-nodemap-final.txt
     1215 5-untip/connected-tip.gfa
        5 5-untip/getCoverage.unitig-popped-unitig-normal-connected-tip.sh
        5 5-untip/getCoverage.unitig-unrolled-hifi-resolved.sh
        5 5-untip/getCoverage.unitig-unrolled-ont-resolved.sh
      501 5-untip/nodecov_hifi_fix.csv
   196241 5-untip/nodelens-1.txt
   197148 5-untip/nodelens-final.txt
      109 5-untip/popinfo.err
      925 5-untip/popped-unitig-normal-connected-tip.gfa
        0 5-untip/prepCoverage.err
       46 5-untip/prepCoverage.sh
      562 5-untip/unitig-mapping-3.txt
      400 5-untip/unitig-mapping-4.txt
     1200 5-untip/unitig-normal-connected-tip.gfa
        0 5-untip/unitig-popped-unitig-normal-connected-tip.getCoverage.err
      604 5-untip/unitig-popped-unitig-normal-connected-tip.gfa
      339 5-untip/unitig-popped-unitig-normal-connected-tip.hifi-coverage.csv
      397 5-untip/unitig-popped-unitig-normal-connected-tip.ont-coverage.csv
        8 5-untip/untip.err
       59 5-untip/untip.sh
  1101677 total

## untip.err
UntipRelative
Unitigify 3
Combine mappings
Combine edges
Find lengths
Fix coverage
Pop bubbles based on coverage
Unitigify 4
skoren commented 2 years ago

It's possible the cluster on Chr2 is disconnected in the graph but the unitig mapping to Chr2 is longer than the reference. Chr5 should not have been disconnected either, that looks like a bug so thanks for sharing this data. I'll make a new issue to look into that. I attached the final graph.

Screen Shot 2022-01-26 at 11 47 30 AM

Restarting part of the pipeline is a good idea. My guess is snakemake thought your failed 4-processONT step was complete and everything downstream of it got corrupted. I'd suggest removing the folders from 4-* onwards as well as the assembly.* files in the top level and restart.

baozg commented 2 years ago

The graph looks awesome since the MBG gfa is very fragmented (153Mb, 100kb N50). Sorry, which reference do you use ? There are two Col-0 asssemly (Col-XJTU and Col-CEN), the data I download is from the Col-XJTU, although the two assembly should is same.

I restarted the pipeline from the ONT reads mapping.

skoren commented 2 years ago

I used TAIR10 as the reference. The MBG graph is expected to be fragmented since it's before HiFi or ONT read resolution.

baozg commented 2 years ago

I meet the same problem as the #40 when I restarted the pipeline from aligning ONT reads.

skoren commented 2 years ago

That seems very strange if it already ran this step before, nothing should have changed. Is the hifi-nodecov-gapped-unitig-unrolled-hifi-resolved.csv file also very small as in #40? Try running the script by hand:

/public10/home/sci0009/miniconda3/envs/snakemake/lib/verkko/scripts/get_original_coverage.py \
  gapped-unitig-unrolled-hifi-resolved.gfa \
  combined-nodemap-uniques.txt \
  combined-edges-uniques.gfa \
  nodelens-uniques.txt \
  ../1-buildGraph/hifi_nodecov.csv \
> hifi-nodecov-gapped-unitig-unrolled-hifi-resolved.csv

/public10/home/sci0009/miniconda3/envs/snakemake/lib/verkko/scripts/estimate_unique_local.py \
  gapped-unitig-unrolled-hifi-resolved.gfa \
  hifi-nodecov-gapped-unitig-unrolled-hifi-resolved.csv \
  alns-ont-filter-trim.gaf 100000 30 0.8 \
> unique_nodes_ont_coverage.txt
baozg commented 2 years ago

Yes. The hifi-nodecov-gapped-unitig-unrolled-hifi-resolved.csv only have the header when I run by hand.

baozg commented 2 years ago

I find the major difference of two runs is the combined-nodemap-uniques.txt.

40216 combined-nodemap-uniques.txt (run1) 20108 ../../Arab_new/4-processONT/combined-nodemap-uniques.txt (run2)

The command difference is below. Which one is correct ?

# run1
cat ../2-processGraph/unitig-mapping-1.txt ../2-processGraph/unitig-mapping-hifigap.txt ../2-processGraph/unroll_mapping_1.txt  \
  > combined-nodemap-uniques.txt

# run2
#  cat *mapping* > ...
cat ../2-processGraph/unitig-mapping-1.txt ../2-processGraph/unroll_mapping_1.txt  \
  > combined-nodemap-uniques.txt
baozg commented 2 years ago

Here is the size of step1 and step2.

 $ ls -lha 1-buildGraph/*
-rw-rw-r-- 1 sci0009 sci0009 724K Jan 24 07:00 1-buildGraph/buildGraph.err
-rwxrwxr-x 1 sci0009 sci0009 1.5K Jan 23 16:09 1-buildGraph/buildGraph.sh
-rw-rw-r-- 1 sci0009 sci0009 292K Jan 24 07:00 1-buildGraph/hifi_nodecov.csv
-rw-rw-r-- 1 sci0009 sci0009 149M Jan 24 07:00 1-buildGraph/hifi-resolved.gfa
-rw-rw-r-- 1 sci0009 sci0009 161M Jan 24 07:00 1-buildGraph/paths.gaf

$ ls -lha 2-processGraph/*
-rw-rw-r-- 1 sci0009 sci0009 153M Jan 24 07:01 2-processGraph/gapped-hifi-resolved.gfa
-rw-rw-r-- 1 sci0009 sci0009 149M Jan 24 07:00 2-processGraph/gapped-once-hifi-resolved.gfa
-rw-rw-r-- 1 sci0009 sci0009 149M Jan 24 07:01 2-processGraph/gapped-twice-hifi-resolved.gfa
-rw-rw-r-- 1 sci0009 sci0009  773 Jan 24 07:00 2-processGraph/gaps-hifi-1.gaf
-rw-rw-r-- 1 sci0009 sci0009    0 Jan 24 07:01 2-processGraph/gaps-hifi-2.gaf
-rw-rw-r-- 1 sci0009 sci0009 308K Jan 24 07:01 2-processGraph/gaps-hifi-3.gaf
-rw-rw-r-- 1 sci0009 sci0009 161M Jan 24 07:00 2-processGraph/paths-nogap-1.gaf
-rw-rw-r-- 1 sci0009 sci0009 161M Jan 24 07:01 2-processGraph/paths-nogap-2.gaf
-rw-rw-r-- 1 sci0009 sci0009 160M Jan 24 07:01 2-processGraph/paths-nogap-3.gaf
-rw-rw-r-- 1 sci0009 sci0009  190 Jan 24 07:01 2-processGraph/process.err
-rwxrwxr-x 1 sci0009 sci0009 1.7K Jan 24 07:00 2-processGraph/processGraph.sh
-rw-rw-r-- 1 sci0009 sci0009 150M Jan 24 07:01 2-processGraph/unitig-gapped-hifi-resolved.gfa
-rw-rw-r-- 1 sci0009 sci0009 686K Jan 24 07:01 2-processGraph/unitig-mapping-1.txt
-rw-rw-r-- 1 sci0009 sci0009 662K Jan 24 07:01 2-processGraph/unitig-mapping-hifigap.txt
-rw-rw-r-- 1 sci0009 sci0009    0 Jan 24 23:52 2-processGraph/unitig-unrolled-hifi-resolved.getCoverage.err
-rw-rw-r-- 1 sci0009 sci0009 149M Jan 24 07:01 2-processGraph/unitig-unrolled-hifi-resolved.gfa
-rw-rw-r-- 1 sci0009 sci0009 317K Jan 24 23:52 2-processGraph/unitig-unrolled-hifi-resolved.hifi-coverage.csv
-rw-rw-r-- 1 sci0009 sci0009 321K Jan 24 23:52 2-processGraph/unitig-unrolled-hifi-resolved.ont-coverage.csv
-rw-rw-r-- 1 sci0009 sci0009 150M Jan 24 07:01 2-processGraph/unrolled-hifi-resolved.gfa
-rw-rw-r-- 1 sci0009 sci0009    0 Jan 24 07:01 2-processGraph/unroll_mapping_1.txt
skoren commented 2 years ago

The second one is correct, this was one of the updates since the beta release but the unroll_mapping_1.txt file still shouldn't be empty. I think the issue is that Snakemake doesn't handle updating code in the middle of the run well so now it's expecting output from an updated 2-processGraph run but it thinks this step is already complete.

You may have to go all the way back to remove everything except the 0-correction and 1-buildGraph steps. I don't think there have been changes to those two parts of the code since the beta.

baozg commented 2 years ago

Yep. I remove the 2-processGraph and run with snakemake. The pipeline output is different with the first run.

skoren commented 2 years ago

Was your crash in #40 also a restart without removing the 2-processGraph folder?

baozg commented 2 years ago

Yep. It was the same. When it crash, then I download the Arabidopsis thalian data. I will close it if this run is successful.

baozg commented 2 years ago

After removing the 2-proceesONT folder, I can produce a normal assembly, but it seems more frgmented than yours. The 9 contigs (>1Mb) consists of 133Mb.

# second column is the size
utig4-438     32652233        34007608        32652233        32652234
utig4-448       22416905        129365070       22416905        22416906
utig4-408       17649930        97249093        17649930        17649931
utig4-406       17323780        11      17323780        17323781
utig4-407       13624567        17622829        13624567        13624568
utig4-400       12842428        78636984        12842428        12842429
utig4-424       10349929        118019802       10349929        10349930
utig4-418       4278631 91767898        4278631 4278632
utig4-401       2231285 115430493       2231285 2231286
utig4-465       299014  17323803        299014  299015

Do you show the graph of 5-untip/unitig-popped-unitig-normal-connected-tip.gfa ? image

skoren commented 2 years ago

Good that you're getting an assembly now. There is some randomness in MBG still, the asm is similar to mine except yours h as a few unresolved bubbles which led to shorter contigs. Are you still using only ONT reads >50kb? That could explain the difference, we've seen improvements in human genome when we use the shorter ONT reads, mostly likely because of some sub-optimal coverage thresholds that verkko has.

baozg commented 2 years ago

Yes. I use the reads > 50kb. I try to use the all ONT reads, but the result is similiar with larger genome size (160Mb) than using reads >50kb (140 Mb). It seems MBG randomness is more contributed to this shorter contigs.

## > 1Mb contigs
utig4-867       32555598        34674956        32555598        32555599
utig4-895       26161544        142057026       26161544        26161545
utig4-865       17649930        116148922       17649930        17649931
utig4-863       17323326        11      17323326        17323327
utig4-898       15876846        17323349        15876846        15876847
utig4-866       12842225        69678118        12842225        12842226
utig4-881       3962735 82615346        3962735 3962736
utig4-715       3906210 137641692       3906210 3906211
utig4-873       2691859 133798864       2691859 2691860

image

skoren commented 2 years ago

We have an issue to track instability in MBG (#24) so I'm going to close this since you can run Verkko now and we'll use your dataset to test variation from run to run.