mdcao / npScarf

25 stars 3 forks source link

running program with .bam and scaffold #12

Open devonorourke opened 6 years ago

devonorourke commented 6 years ago

Hi,

I'm curious if it's possible to run npScarf with ONT reads aligned to a reference genome (consisting of unique contigs). I've aligned the long reads with minimap2 to generate an indexed and sorted .bam file.

It was my thought that all that was needed to accomplish this was to supply the .bam file along with the original contigs.fasta file which the reads were aligned to. Perhaps there is more? This was my command:

BAM=/path/to/my/sample.bam
CONTIGS=/path/to/my/contigs.fa

jsa.np.npscarf --input $BAM --format bam --seqFile $CONTIGS > log.out 2>&1

This generated a series of files (out.fin.fasta, out.fin.japsa, and the .log file), yet the number of unique scaffods (and total nucleotides) was orders of magnitude less than what was present in the original contigs.fa file.

In addition, there was also a series of warnings and errors in the log file:

[main] WARN japsa.tools.bio.np.NPScarfCmd - Not found any legal SPAdes output folder, assembly graph thus not included!
#Sort list of bridges
========================== START =============================
  contig   0  ======>     0   57033 contig_0 
Size = 1 sequence
============================ END ===========================
...
many more of these 
...
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
    at java.base/java.lang.System.arraycopy(Native Method)
    at java.base/java.util.Arrays.copyOfRange(Arrays.java:4030)
    at japsa.seq.Sequence.subSequence(Sequence.java:272)
    at japsa.bio.hts.scaffold.ContigBridge$Connection.filling(ContigBridge.java:1201)
    at japsa.bio.hts.scaffold.Scaffold.viewSequence(Scaffold.java:438)
    at japsa.bio.hts.scaffold.ScaffoldGraph.printSequences(ScaffoldGraph.java:965)
    at japsa.tools.bio.np.NPScarfCmd.main(NPScarfCmd.java:291)

I'm wondering what to make of the errors. Are each of these exceptions a product of me not providing a graph assembly? Are there additional arguments I should be entering with the jsa.np.npscarf command?

hsnguyen commented 6 years ago

Hi, Can you please provide the version used and the full log file? The input you provided is all it need to perform a fast scaffolding; assembly graph is optional to improve its accuracy.

devonorourke commented 6 years ago

Thanks for the reply.

version

Japsa: A Java Package for Statistical Sequence Analysis
Version 1.7-05b, Built on Mon Jul 09 15:42:09 EDT 2018 with javac 1.8.0_144

Log file from stdout:

[main] WARN japsa.tools.bio.np.NPScarfCmd - Not found any legal SPAdes output folder, assembly graph thus not included!
#Sort list of bridges
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
    at java.base/java.lang.System.arraycopy(Native Method)
    at java.base/java.util.Arrays.copyOfRange(Arrays.java:4030)
    at japsa.seq.Sequence.subSequence(Sequence.java:272)
    at japsa.bio.hts.scaffold.ContigBridge$Connection.filling(ContigBridge.java:1201)
    at japsa.bio.hts.scaffold.Scaffold.viewSequence(Scaffold.java:438)
    at japsa.bio.hts.scaffold.ScaffoldGraph.printSequences(ScaffoldGraph.java:965)
    at japsa.tools.bio.np.NPScarfCmd.main(NPScarfCmd.java:291)

and the main log file from the program:

========================== START =============================
  contig   0  ======>     0   57033 contig_0 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig   1  ======>     0   11664 contig_1 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig   2  ======>     0    3924 contig_2 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig   3  ======>     0    2163 contig_3 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig   4  ======>     0   11109 contig_4 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig   5  ======>     0    3597 contig_5 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig   6  ======>     0   46348 contig_6 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig   7  ======>     0    9902 contig_7 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig   8  ======>     0   28300 contig_8 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig   9  ======>     0   19290 contig_9 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  10  ======>     0   35361 contig_10 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  11  ======>     0    2978 contig_11 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  12  ======>     0   13273 contig_12 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  13  ======>     0   64590 contig_13 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  14  ======>     0   22464 contig_14 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  15  ======>     0   21664 contig_15 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  16  ======>     0    3191 contig_16 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  17  ======>     0   46376 contig_17 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  18  ======>     0   15823 contig_18 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  19  ======>     0   36682 contig_19 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  20  ======>     0    2301 contig_20 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  21  ======>     0   38428 contig_21 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  22  ======>     0  112504 contig_22 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  23  ======>     0   18737 contig_23 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  24  ======>     0    2478 contig_24 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  25  ======>     0    2096 contig_25 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  26  ======>     0   21758 contig_26 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  27  ======>     0    1767 contig_27 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  28  ======>     0    1488 contig_28 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  29  ======>     0  148562 contig_29 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  30  ======>     0   13734 contig_30 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  31  ======>     0   15223 contig_31 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  32  ======>     0   31993 contig_32 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  33  ======>     0   35876 contig_33 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  34  ======>     0    1782 contig_34 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  35  ======>     0    2909 contig_35 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  36  ======>     0    3489 contig_36 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  37  ======>     0  108005 contig_37 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  38  ======>     0   23854 contig_38 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  39  ======>     0    2960 contig_39 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  40  ======>     0    1400 contig_40 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  41  ======>     0    1683 contig_41 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  42  ======>     0    1637 contig_42 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  43  ======>     0    1529 contig_43 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  44  ======>     0    2810 contig_44 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  45  ======>     0    1187 contig_45 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  46  ======>     0  103828 contig_46 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  47  ======>     0   58065 contig_47 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  48  ======>     0   41333 contig_48 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  49  ======>     0   12025 contig_49 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  50  ======>     0    3734 contig_50 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  51  ======>     0  214484 contig_51 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  52  ======>     0  101119 contig_52 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  53  ======>     0    2790 contig_53 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  54  ======>     0   12789 contig_54 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  55  ======>     0    3603 contig_55 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  56  ======>     0    7372 contig_56 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  57  ======>     0   26805 contig_57 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  58  ======>     0   40508 contig_58 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  59  ======>     0   68746 contig_59 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  60  ======>     0   43533 contig_60 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  61  ======>     0    7689 contig_61 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  62  ======>     0   30740 contig_62 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  63  ======>     0   76787 contig_63 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  64  ======>     0    2552 contig_64 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  65  ======>     0  159498 contig_65 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  66  ======>     0   94585 contig_66 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  67  ======>     0   26973 contig_67 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  68  ======>     0   98250 contig_68 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  69  ======>     0   20657 contig_69 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  70  ======>     0    9166 contig_70 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  71  ======>     0   56353 contig_71 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  72  ======>     0   18272 contig_72 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  73  ======>     0   76385 contig_73 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  74  ======>     0    2163 contig_74 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  75  ======>     0   19157 contig_75 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  76  ======>     0   64880 contig_76 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  77  ======>     0    7124 contig_77 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  78  ======>     0   17870 contig_78 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  79  ======>     0   25924 contig_79 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  80  ======>     0    3119 contig_80 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  81  ======>     0    3007 contig_81 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  82  ======>     0  108357 contig_82 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  83  ======>     0  137239 contig_83 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  84  ======>     0   18846 contig_84 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  85  ======>     0   83176 contig_85 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  86  ======>     0    8538 contig_86 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  87  ======>     0  102171 contig_87 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  88  ======>     0  102584 contig_88 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  89  ======>     0   68312 contig_89 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  90  ======>     0    4491 contig_90 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  91  ======>     0   76669 contig_91 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  92  ======>     0    1976 contig_92 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  93  ======>     0   14133 contig_93 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  94  ======>     0   42122 contig_94 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  95  ======>     0    1865 contig_95 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  96  ======>     0   71343 contig_96 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  97  ======>     0    3452 contig_97 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  98  ======>     0    3142 contig_98 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig  99  ======>     0    9282 contig_99 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 100  ======>     0   13018 contig_100 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 101  ======>     0  123637 contig_101 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 102  ======>     0   19738 contig_102 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 103  ======>     0   19721 contig_103 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 104  ======>     0   15656 contig_104 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 105  ======>     0   93369 contig_105 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 106  ======>     0   66779 contig_106 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 107  ======>     0   62539 contig_107 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 108  ======>     0   23076 contig_108 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 109  ======>     0   44246 contig_109 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 110  ======>     0    9789 contig_110 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 111  ======>     0   19394 contig_111 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 112  ======>     0   37531 contig_112 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 113  ======>     0   88731 contig_113 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 114  ======>     0    3254 contig_114 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 115  ======>     0   95948 contig_115 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 116  ======>     0  113561 contig_116 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 117  ======>     0  123301 contig_117 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 118  ======>     0   42704 contig_118 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 119  ======>     0  172348 contig_119 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 120  ======>     0    3029 contig_120 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 121  ======>     0   29671 contig_121 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 122  ======>     0   39772 contig_122 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 123  ======>     0   50605 contig_123 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 124  ======>     0   34707 contig_124 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 125  ======>     0   75815 contig_125 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 126  ======>     0    6169 contig_126 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 127  ======>     0   28989 contig_127 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 128  ======>     0   51429 contig_128 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 129  ======>     0    1645 contig_129 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 130  ======>     0     817 contig_130 
Size = 1 sequence
============================ END ===========================
========================== START =============================
  contig 131  ======>-74178    -115 contig_131 gaps =  115
  contig 132  ======>     0  128995 contig_132 
Size = 2 sequence
devonorourke commented 6 years ago

One thing to note: the .bam file has 1587 contigs, while the resulting scaffold file that npScarf produced has just 131 unique sequences.

Maybe I'm just making an oversight. Is the resulting out.fin.fasta file exclusively the regions which were joined, excluding all other sequences in the original reference fasta which were not merged?

hsnguyen commented 6 years ago

Since your contigs don't have information about the coverage, npScarf just relied on the length of contigs to determine anchor (significant single-copy contigs) vs artefact/repeatitive contigs. So the short contigs were considered insignificant thus not reported here.

Also looingk at the log file, only one bridge was created, meaning that there is only one useful alignment fo npScarf from your BAM - is it normal in this case? I think it's not so much useful being able to only join 2 contigs together. Anw, can you plz try clone the newest git version of japsa and run again with --verbose option and send the log file to me.

Another tip is to play around with parameters of npScarf since your genome might need different settings; especially the --maxRepeat is important since your data is lack of coverage information.

devonorourke commented 6 years ago

Thanks - can we back up a step? Where does npScarf pull the coverage info; could I supply a file that does that into the current command as a parameter? There should be about 17x coverage (estimated by a recent canu assembly) for the genome in question.

I'll try rerunning shortly and send you an update.

Cheers

hsnguyen commented 6 years ago

npScarf will try to pull the coverage info from the contigs' name output by short-read assemblers. For example, if contigs come from SPAdes/Velvet, they will have format like

(ID)_length_(y)_cov_(z)

; or

(ID) (length) (k-mer sum)

from AbySS...

If I understand it right, you're trying to use npScarf to finish Canu assembly, no? You can do that eventhough the task is not designed for the tool to work with. npScarf's main purpose is to complete the assembly from a short-read assembler (input Illumina data) by using the long-read data (nanopore reads). The idea is that due to repetitive elements, short-read assemblers are usually suffered from fragmentations; and since long reads can span long regions, they'd help to bridge those gaps. Canu itself takes long read data as input, so the bridging information is already used in its pipeline. So I'm afraid reusing the long reads to scaffold Canu's contigs wouldn't give significant improvement. Correct me if I'm wrong. Cheers,

roblanf commented 5 years ago

Hi there,

just weighing in here because we're also considering using npscarf to scaffold a long-read assembly from CANU. Here's one reason I think it might be useful:

CANU is not diploid aware, so in our assembly we get a partially phased genome. We can deal with the partial phasing by deleting one copy of each of the phased regions and then re-polishing the final contigs with short reads. But the result is still a highly fragmented assembly, for which the long reads should still contain good scaffolding information.

So, in this use case I think npscarf might be worth trying. My question, can you specify the exact syntax required for the contig names?

Given our fasta files of contigs, I can easily re-map long or short reads, and rename contigs to include coverage information in the way that npScarf expects it.

Also, I'm assuming that npScarf is justing relying on relative coverage of contigs (in the sense that collapsed repeats will have higher coverage). Is that right?

Cheers,

Rob

hsnguyen commented 5 years ago

The contig name follow the syntax from SPAdes/Velvet or AbySS (dev) assembler; the default one based on SPAdes ouput when you'll have for example >NODE_1_length_839658_cov_89.484 npScarf will try to extract the coverage information from that name and convert to read coverage.

It also relies on the length of the sequence to help guessing if it's repetitive or not. You can change this from --maxRepeat parameter.

It's around 8Kb by default for our bacteria datasets of interest; but you need to increase this figure if you're working with more complex genomes. It depends on the genomes you're trying to assembly, or just set it very long if you are not sure.

Good luck with your pipeline, even though last time when checking the run log, there weren't many useful alignments had been used to scaffold further your draft Canu assembly. I think it might be worth trying more sensitive aligner for more alignments too (e.g. i normally run minimap2 with -k15 -w5)

Cheers,