lbcb-sci / raven

De novo genome assembler for long uncorrected reads
MIT License
204 stars 21 forks source link

Raven is not performing so well on highly heterozygous regions as Ra #7

Closed nadegeguiglielmoni closed 4 years ago

nadegeguiglielmoni commented 5 years ago

Hi,

Thank you for Ra and Raven.

Ra v0.2.1 Raven v0.0.3

I have a diploid genome for which other assemblers would juxtapose the two haplotypes instead of "crushing" them. I tested Ra and was very happy to find that it performed much better than the other long read assemblers I know on this aspect, when running it with the longest reads. However, Ra shortened repeated regions, so I tested Raven hoping that it could perhaps improve this aspect. Sadly, when using Raven with the longest read as I did with Ra, I found juxtaposed haplotypes like I had with other assemblers.

rvaser commented 5 years ago

Hello Nadège, the main differences between Ra and Raven are that nothing is stored to the disc anymore, and Raven employs a new heuristic for graph untangling. Although, we changed alignment parameters of Racon from -m 5 -x -4 -g -8 to -m 3 -x -5 -g -4. Are there bigger structural differences between assemblies or is the difference manifested in SNPs and indels?

Best regards, Robert

nadegeguiglielmoni commented 5 years ago

Thank you for your answer.

One big difference for sure between the two assemblies is that the Ra assembly is 93 Mb and the Raven assembly is 109 Mb, using the same reads.

Best regards, Nadège

rvaser commented 5 years ago

What is the expected size of the genome you are assembling?

nadegeguiglielmoni commented 5 years ago

~91-92 Mb

rvaser commented 5 years ago

How about the contig number in both assemblies?

ocho commented 5 years ago

Hello. May I highjack this issue. Just some notes.

Currently I am doing several draft assemblies with a highly heterozygous genomes as well which are similar in size (~65MB). Unfortunately the sequenced coverage might be a little bit low in regards for the degree of heterozygosity (>1.5% het, ~60x depth). So I attempted to include self-error-correction with canu trying to collapse the heterogeneity of the reads a bit. Does error-correction in any way affect Ra/Raven and should be avoided? (e: I noticed there are high-copy number plasmids / regions but apart from the Mitochondria I didn't looked at it too much) The reason why I try to do that is, basically, I get very low coverage when I back-align the raw reads to the assembly (only 30x ? ...utter confusion)

Before I used a hybrid-approach using Masurca which gave good results so far and even get telomeric ends. Now with Raven the telomeric ends seem to be collapsed (my guess, they are not annotated correctly as 'repeats'). This is potentially something I try to avoid. The contig numbers are amazingly low <60-100ish for this reason with Raven in comparison to Masurca-hybrid assemblies which result in 100-150ish contigs (and I think the telomeres are split correctly).

After all, an amazing assembler which works for the most crucial parts in our genome. The long repetitive PKS-related genes (10kbp with 5k repetitive elements) of interest are correctly assembled using Flye, Masurca, Canu and here also Raven. Something SPAdes (hybrid), wtdbg2 (pacbio-only) have not been able to handle that well. Now I try to tweak the contiguity and just learned about raven last week via: https://github.com/rrwick/Long-read-assembler-comparison#assemblers-and-commands

I also suspect the Pacbio reads are not able to span the repetetive regions in the genome, so the contiguity might be limited.

See mapping statistics from qualimap in the attachment: qualimapReport.pdf

rvaser commented 5 years ago

Hi Martin, self-corrected reads should not affect the outcome by much, although we did not investigate that in detail. We are not treating telomeres differently at the moment, but have plans to try and deal with it in the future. Not sure what to tell you about the lower percent of raw reads aligned to the reference, maybe the alignment parameters in Racon should be changed (current are 3 -5 -4). If you want to do so, run Raven with -p 0 which will not run Racon, and afterwards run Racon manually with different alignment parameters. You can also polish the genome with Illumina reads and Racon, if you have them.

Sorry for the wait! Best regards, Robert

ghost commented 5 years ago

Hello, how different is exactly the new heuristic for graph untangling? I would suspect there might be the problem (I am working with the original author of this thread).

rvaser commented 5 years ago

At the end of layout, instead of removing edges by overlap length from junction nodes, we implemented a new approach based on graph drawings. We still remove edges, but now this procedure is guided with vertex distances in the drawing. I think this is not the cause of mixed haplotypes.

Nadège will try with different alignment parameters for polishing, maybe this is the culprit. If not, I'll investigate further.

Best regards, Robert

ghost commented 5 years ago

thank you, just to be clear and be sure we are on the same ground, the problem we have is that Raven outputs both haplotypes A' and A'', instead of outputting A (crushed version of A' and A''). Hence we need to remove one of the path in the graph. How does Racon alignment parameters influence this? My understanding was that Racon operates after the graphe pathes have been "resolved" no? I hope I am making sense. Thanks for your patience with my many questions and requests; greatly appreciated.

rvaser commented 5 years ago

I guess I misunderstood. How many haplotigs are in the assembly that are not crushed? Do you have some estimate how large they are? Maximal or average length?

ghost commented 5 years ago

We have noticed in previous assemblies, for instance with Flye, that some regions had half the expected coverage. Inspecting the gfa revealed uncollapsed bubbles (and the genome we are working with is notoriously hard to assemble)(hence why I think adding gfa output to Raven would be really useful ^^ as I find myself using gfa quite a lot).

Here are some stats of those contigs, where regions of at least 10 Kb of half coverage are found

summary(data$V2)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  102758   156237   288834  1174058   847838 12329754

Note however that those contigs are not necessarily affected affected by the coverage loss on their entire length. Their are in total, among those contigs, around 20 Mb with half the expected coverage.

EDIT: more accurate answer

rvaser commented 5 years ago

I have fixed a bug in bubble popping of Raven, which might fix the problem. Could you try running v0.0.5? Although, I saw that the number of surviving reads used for graph construction differs between Raven and Ra, which also might be the culprit (Ra has ~2500 less).

ghost commented 5 years ago

The assembly is still 107 Mb long, so too long.

ghost commented 5 years ago

could you give me a "step by step", running first minimap2 and Rala? Seeing the gfa would be extremely helpful I think

rvaser commented 5 years ago

Thanks for testing! You can just run Ra, but first remove (it->outdegree() == 0 && it->indegree() == 0) from here, file found in vendor/rala in your Ra folder. Recompile, run and you will get the full GFA this time.

If you want to skip Racon, put return here and run cmake + make.

ghost commented 5 years ago

That's done already :D We tried to clean the gfa but that gave a very very strange result. We removed some paths in the graph with Bandage and save the "positive nodes". And that results in scaffolds having a coverage of 1.

So, is there something that rala does when untangling its graph that we are not doing when manually editing the gfa with bandage?

rvaser commented 5 years ago

Hmm, the gfa is printed at the end of layout, so it should be equal the final assembly, but being unpolished. All nodes in the graph declared as unitigs are passed to Racon. Maybe the low accuracy is the problem?

ghost commented 5 years ago

I polished with racon our manually curated final assembly and that doesn't solve the issue. Of note, here are the statistics of the Ra output and the gfa, after removing the code as suggested (it->outdegree() == 0 && it->indegree() == 0)

assemblyMetrics AtumRa_longONT.vaga.fa
#-------------------- GLOBAL STATISTICS -------------------#
N50 size= 5013827  number= 7
N80 size= 776127  number= 22
N90 size= 286147  number= 45
Assembly size= 91975795 number= 101 minSize= 39546 maxSize= 9523105 averageSize= 910651
#----------------------------------------------------------#
#-------------------- SIZE REPARTITION --------------------#
Size= >= 5000000    Number= 7          (6.93)    CumulativeSize= 47874282   (52.05)
Size= >= 1000000    Number= 17         (16.83)    CumulativeSize= 69888098    (75.99)
Size= >= 100000     Number= 89         (88.12)    CumulativeSize= 91060841    (99.01)
Size= >= 50000      Number= 100        (99.01)    CumulativeSize= 91936249    (99.96)
Size= >= 10000      Number= 101        (100.00)    CumulativeSize= 91975795     (100.00)
#----------------------------------------------------------#

>Ctg0 LN:i:5638592 RC:i:2526 XC:f:0.997428
>Ctg1 LN:i:70020 RC:i:51 XC:f:1.000000
>Ctg2 LN:i:6304696 RC:i:2908 XC:f:0.999762
>Ctg3 LN:i:6563805 RC:i:3116 XC:f:1.000000
>Ctg4 LN:i:373335 RC:i:179 XC:f:1.000000
>Ctg5 LN:i:119541 RC:i:39 XC:f:0.979079
>Ctg6 LN:i:392923 RC:i:154 XC:f:1.000000
>Ctg7 LN:i:4479504 RC:i:2055 XC:f:1.000000
>Ctg8 LN:i:9523105 RC:i:4317 XC:f:0.999317
>Ctg9 LN:i:229097 RC:i:118 XC:f:1.000000
>Ctg10 LN:i:1919320 RC:i:974 XC:f:1.000000
>Ctg11 LN:i:69151 RC:i:32 XC:f:1.000000
>Ctg12 LN:i:611767 RC:i:318 XC:f:1.000000
>Ctg13 LN:i:313795 RC:i:132 XC:f:1.000000
>Ctg14 LN:i:268057 RC:i:127 XC:f:0.998127
>Ctg15 LN:i:5354821 RC:i:2625 XC:f:1.000000
>Ctg16 LN:i:301520 RC:i:113 XC:f:1.000000
>Ctg17 LN:i:5013827 RC:i:2440 XC:f:0.999102
>Ctg18 LN:i:2014119 RC:i:952 XC:f:1.000000
>Ctg19 LN:i:412980 RC:i:313 XC:f:1.000000
>Ctg20 LN:i:270167 RC:i:140 XC:f:1.000000
>Ctg21 LN:i:9475436 RC:i:4575 XC:f:0.999419
>Ctg22 LN:i:142749 RC:i:69 XC:f:1.000000
>Ctg23 LN:i:172637 RC:i:97 XC:f:1.000000
>Ctg24 LN:i:448823 RC:i:233 XC:f:1.000000
>Ctg25 LN:i:286147 RC:i:166 XC:f:1.000000
>Ctg26 LN:i:197611 RC:i:120 XC:f:1.000000
>Ctg27 LN:i:161380 RC:i:89 XC:f:0.987500
>Ctg28 LN:i:1461487 RC:i:718 XC:f:1.000000
>Ctg29 LN:i:1170851 RC:i:532 XC:f:1.000000
>Ctg30 LN:i:126655 RC:i:57 XC:f:1.000000
>Ctg31 LN:i:138331 RC:i:56 XC:f:1.000000
>Ctg32 LN:i:124860 RC:i:64 XC:f:1.000000
>Ctg33 LN:i:856973 RC:i:443 XC:f:1.000000
>Ctg34 LN:i:3534828 RC:i:1583 XC:f:1.000000
>Ctg35 LN:i:274507 RC:i:123 XC:f:1.000000
>Ctg36 LN:i:1130814 RC:i:571 XC:f:1.000000
>Ctg37 LN:i:4129422 RC:i:1955 XC:f:0.999879
>Ctg38 LN:i:503305 RC:i:211 XC:f:1.000000
>Ctg39 LN:i:193693 RC:i:88 XC:f:0.982005
>Ctg40 LN:i:269709 RC:i:129 XC:f:1.000000
>Ctg41 LN:i:820144 RC:i:410 XC:f:1.000000
>Ctg42 LN:i:51742 RC:i:29 XC:f:1.000000
>Ctg43 LN:i:221274 RC:i:78 XC:f:1.000000
>Ctg44 LN:i:138415 RC:i:52 XC:f:0.996403
>Ctg45 LN:i:413622 RC:i:242 XC:f:1.000000
>Ctg46 LN:i:207401 RC:i:105 XC:f:0.997619
>Ctg47 LN:i:139703 RC:i:103 XC:f:1.000000
>Ctg48 LN:i:776127 RC:i:362 XC:f:0.999354
>Ctg49 LN:i:303925 RC:i:93 XC:f:1.000000
>Ctg50 LN:i:316126 RC:i:159 XC:f:1.000000
>Ctg51 LN:i:297472 RC:i:151 XC:f:1.000000
>Ctg52 LN:i:184521 RC:i:102 XC:f:1.000000
>Ctg53 LN:i:306927 RC:i:121 XC:f:1.000000
>Ctg54 LN:i:1151566 RC:i:553 XC:f:1.000000
>Ctg55 LN:i:111607 RC:i:57 XC:f:1.000000
>Ctg56 LN:i:148688 RC:i:185 XC:f:1.000000
>Ctg57 LN:i:257135 RC:i:144 XC:f:1.000000
>Ctg58 LN:i:918058 RC:i:502 XC:f:1.000000
>Ctg59 LN:i:80759 RC:i:66 XC:f:1.000000
>Ctg60 LN:i:237287 RC:i:147 XC:f:1.000000
>Ctg61 LN:i:158460 RC:i:114 XC:f:1.000000
>Ctg62 LN:i:466534 RC:i:253 XC:f:1.000000
>Ctg63 LN:i:158967 RC:i:101 XC:f:1.000000
>Ctg64 LN:i:286697 RC:i:196 XC:f:1.000000
>Ctg65 LN:i:834810 RC:i:368 XC:f:1.000000
>Ctg66 LN:i:211910 RC:i:93 XC:f:1.000000
>Ctg67 LN:i:450778 RC:i:224 XC:f:1.000000
>Ctg68 LN:i:177777 RC:i:98 XC:f:0.997214
>Ctg69 LN:i:197143 RC:i:120 XC:f:1.000000
>Ctg70 LN:i:164444 RC:i:97 XC:f:1.000000
>Ctg71 LN:i:211925 RC:i:119 XC:f:1.000000
>Ctg72 LN:i:1021905 RC:i:487 XC:f:0.994126
>Ctg73 LN:i:551019 RC:i:301 XC:f:1.000000
>Ctg74 LN:i:165782 RC:i:102 XC:f:1.000000
>Ctg75 LN:i:304185 RC:i:155 XC:f:1.000000
>Ctg76 LN:i:97480 RC:i:57 XC:f:1.000000
>Ctg77 LN:i:98052 RC:i:42 XC:f:1.000000
>Ctg78 LN:i:162220 RC:i:82 XC:f:1.000000
>Ctg79 LN:i:257157 RC:i:114 XC:f:1.000000
>Ctg80 LN:i:59508 RC:i:49 XC:f:1.000000
>Ctg81 LN:i:94758 RC:i:22 XC:f:1.000000
>Ctg82 LN:i:316552 RC:i:160 XC:f:1.000000
>Ctg83 LN:i:86210 RC:i:62 XC:f:1.000000
>Ctg84 LN:i:218525 RC:i:146 XC:f:0.997712
>Ctg85 LN:i:131393 RC:i:108 XC:f:1.000000
>Ctg86 LN:i:127402 RC:i:74 XC:f:1.000000
>Ctg87 LN:i:171301 RC:i:72 XC:f:1.000000
>Ctg88 LN:i:183238 RC:i:94 XC:f:1.000000
>Ctg89 LN:i:193594 RC:i:105 XC:f:1.000000
>Ctg90 LN:i:206238 RC:i:110 XC:f:1.000000
>Ctg91 LN:i:137318 RC:i:98 XC:f:1.000000
>Ctg92 LN:i:178419 RC:i:61 XC:f:1.000000
>Ctg93 LN:i:297386 RC:i:186 XC:f:1.000000
>Ctg94 LN:i:540780 RC:i:308 XC:f:1.000000
>Ctg95 LN:i:39546 RC:i:24 XC:f:1.000000
>Ctg96 LN:i:330805 RC:i:203 XC:f:1.000000
>Ctg97 LN:i:75586 RC:i:97 XC:f:0.993421
>Ctg98 LN:i:208066 RC:i:97 XC:f:0.995215
>Ctg99 LN:i:92142 RC:i:45 XC:f:1.000000
>Ctg100 LN:i:182924 RC:i:68 XC:f:1.000000

Par contre, voici les paramètres pour ce qui figure dans le GFA que tu as envoyé en même temps:

GFA
#-------------------- GLOBAL STATISTICS -------------------#
N50 size= 4089153  number= 9
N80 size= 180346  number= 68
N90 size= 67653  number= 173
Assembly size= 108734829 number= 627 minSize= 1270 maxSize= 9430185 averageSize= 173421
#----------------------------------------------------------#
#-------------------- SIZE REPARTITION --------------------#
Size= >= 5000000    Number= 6          (0.96)    CumulativeSize= 42459887   (39.05)
Size= >= 1000000    Number= 17         (2.71)    CumulativeSize= 69207928   (63.65)
Size= >= 100000     Number= 110        (17.54)    CumulativeSize= 92719009    (85.27)
Size= >= 50000      Number= 222        (35.41)    CumulativeSize= 100698443    (92.61)
Size= >= 10000      Number= 515        (82.14)    CumulativeSize= 108061633    (99.38)
Size= >= 5000       Number= 587        (93.62)    CumulativeSize= 108607516    (99.88)
Size= >= 1500       Number= 624        (99.52)    CumulativeSize= 108730659    (100.00)
Size= >= 1000       Number= 627        (100.00)    CumulativeSize= 108734829     (100.00)
#----------------------------------------------------------#

Anything fishy? If I am correct, it's not an issue per se that the gfa is bigger than the final assembly, right?

rvaser commented 5 years ago

The GFA is usually bigger than the assembly, so everything looks fine with the displayed output.

ghost commented 5 years ago

What's the exact reason? For example, if I do awk '/^S/{print ">"$2"\n"$3}' rala_assembly_graph.gfa|fold > test.fa I have

Total size: 108734829
N50: 4089153        L50: 9
N75: 282365     L75: 44
N90: 67653      L90: 173
N99: 12705      L99: 478
Average: 173420

while from the fasta output by Ra

#contigs: 101
Total size: 91975795
N50: 5013827        L50: 7
N75: 1021905        L75: 17
N90: 286147     L90: 45
N99: 111607     L99: 89
Average: 910651

What does Ra do that my awk one-liner doesn't? that's important to know, otherwise when editing the gfa manually I don't get the expected output.

Thanks for the clarification. And I would like to emphasise we greatly appreciate your constant feedback.

rvaser commented 5 years ago

Only unitigs which are at least 10kbp long and consist of at least 6 reads are used as the final assembly. Each sequence header in the gfa has sam tags LN:i:<int> and RC:i:<int>. You can find unitigs by searching headers that start with "Utg", and those that have LN >= 9999 and RC > 5 are kept for the resulting fasta.

Could you please run Raven once more without graph postprocessing so we can rule that out? You have to comment out lines here and recompile. I want to see if the assembly size remains much greater than that of Ra.

Thanks a lot for the effort put in evaluating Ra/Raven!:)

ghost commented 5 years ago

Hello, I think it doesn't do what was expected, here is the output stats

N50 assembly_noUnitigFilter.fa 
#contigs: 138
Total size: 92802999
N50: 1526535            L50: 16
N75: 589585             L75: 41
N90: 230087             L90: 80
N99: 124719             L99: 129
Average: 672485

I commented out in the graph.cpp

//    create_unitigs(42);
//    for (std::uint32_t i = 0; i < 16; ++i) {
//        create_force_directed_layout();
//        remove_long_edges();
//        remove_tips();
//    }

So it seems somehow the graph is getting trimmed anyway.

rvaser commented 5 years ago

Thanks for running Raven again. Given that the default Raven assembly is longer by 16Mbp, I probably did not properly clean removed edges. I found a similar increase in size on one Drosophila dataset and will investigate.

ghost commented 5 years ago

Hello,

did you make any progress on the issue? Just curious, I am also willing to invest time in testing if it can be of any help (well, you obviously have your own test dataset but just in case).

rvaser commented 5 years ago

I am still trying to figure out what is the reason for the increased size. I tried the same heuristic directly on the Ra assembly. On one set it decreases the total size by a little, while on the other it increases it significantly. I'll get back to you soon.

ghost commented 4 years ago

May I suggest to check if the increased size is due to non collapsed alellic pairs? In the case of my organism it seems to be the case (sorry the suggestion might come oddly late). If this is not the case, I would like to know if possible.

rvaser commented 4 years ago

Could be the problem. Can you please run Raven again with some modifications? I have been playing with this for a while now. First modification should be put here and addresses the issue if some contigs are contained in others (it will not change the current assembly, just print a log).

auto me = ram::createMinimizerEngine(15, 5, thread_pool);
me->minimize(contigs.begin(), contigs.end());
me->filter(0.001);

std::uint64_t len = 0;
for (const auto& it: contigs) len += it->data.size();
std::cerr << "Contigs = " << contigs.size() << std::endl;
std::cerr << "Assembly length = " << len << std::endl;

std::uint64_t cc = 0;
len = 0;
for (const auto& it: contigs) {
    auto overlaps = me->map(it, true, false);
    bool skip = false;
    for (const auto& jt: overlaps) {
        if (jt.q_end - jt.q_begin > 0.90 * it->data.size()) { skip = true; break; }
    }
    if (!skip) { len += it->data.size(); ++cc; }
}
std::cerr << "Contigs = " << cc << std::endl;
std::cerr << "Assembly length = " << len << std::endl;

The second modification is to reduce unitig ends here. Replace this lines with the ones bellow. You should run the first modification alone, and later add the second. You should turn off polishing with -p 0 to decrease execution time.

    std::uint32_t start = node->inedges.empty() ? 0 : node->inedges[0]->begin->data.size() - node->inedges[0]->length;
    std::uint32_t length = node->outedges.empty() ? node->data.size() - start : node->outedges[0]->length - start;
    std::string data = node->data.substr(start, length);

    std::string name = "Ctg" + std::to_string(contig_id);
    name += " RC:i:" + std::to_string(node->sequences.size());
    name += " LN:i:" + std::to_string(data.size());

Thanks in advance! Robert

ghost commented 4 years ago

Sure, However your code doesn't compile

/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp: In function ‘int main(int, char**)’:
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp:122:20: error: ‘createMinimizerEngine’ is not a member of ‘ram’
     auto me = ram::createMinimizerEngine(15, 5, thread_pool);
                    ^~~~~~~~~~~~~~~~~~~~~
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp:122:20: note: suggested alternative: ‘MinimizerEngine’
     auto me = ram::createMinimizerEngine(15, 5, thread_pool);
                    ^~~~~~~~~~~~~~~~~~~~~
                    MinimizerEngine
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp:136:26: error: unable to deduce ‘auto&&’ from ‘overlaps’
     for (const auto& jt: overlaps) {
                          ^~~~~~~~
make[2]: *** [CMakeFiles/raven.dir/build.make:76: CMakeFiles/raven.dir/src/main.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:79: CMakeFiles/raven.dir/all] Error 2
make: *** [Makefile:152: all] Error 2

I am not familiar at all with C++, is there anything related to indentation I should be particularly cautious with? Otherwise I think I pasted correctly the code snippet.

rvaser commented 4 years ago

I forgot that you have to add includes at the beggining of main.cpp file (lines bellow). Indentation in C++ does not matter :)

#include "ram/overlap.hpp"
#include "ram/minimizer_engine.hpp"
ghost commented 4 years ago

still not compiling

In file included from /mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp:2:
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:29:50: error: ‘ThreadPool’ is not a member of ‘thread_pool’
     std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                  ^~~~~~~~~~
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:29:50: note: suggested alternative: ‘Threadpool’
     std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                  ^~~~~~~~~~
                                                  Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:29:50: error: ‘ThreadPool’ is not a member of ‘thread_pool’
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:29:50: note: suggested alternative: ‘Threadpool’
     std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                  ^~~~~~~~~~
                                                  Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:29:60: error: template argument 1 is invalid
     std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                            ^
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:61:54: error: ‘ThreadPool’ is not a member of ‘thread_pool’
         std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                      ^~~~~~~~~~
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:61:54: note: suggested alternative: ‘Threadpool’
         std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                      ^~~~~~~~~~
                                                      Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:61:54: error: ‘ThreadPool’ is not a member of ‘thread_pool’
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:61:54: note: suggested alternative: ‘Threadpool’
         std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                      ^~~~~~~~~~
                                                      Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:61:64: error: template argument 1 is invalid
         std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                                ^
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:64:38: error: ‘ThreadPool’ is not a member of ‘thread_pool’
         std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                      ^~~~~~~~~~
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:64:38: note: suggested alternative: ‘Threadpool’
         std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                      ^~~~~~~~~~
                                      Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:64:38: error: ‘ThreadPool’ is not a member of ‘thread_pool’
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:64:38: note: suggested alternative: ‘Threadpool’
         std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                      ^~~~~~~~~~
                                      Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:64:48: error: template argument 1 is invalid
         std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                ^
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:77:34: error: ‘ThreadPool’ is not a member of ‘thread_pool’
     std::shared_ptr<thread_pool::ThreadPool> thread_pool_;
                                  ^~~~~~~~~~
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:77:34: note: suggested alternative: ‘Threadpool’
     std::shared_ptr<thread_pool::ThreadPool> thread_pool_;
                                  ^~~~~~~~~~
                                  Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:77:34: error: ‘ThreadPool’ is not a member of ‘thread_pool’
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:77:34: note: suggested alternative: ‘Threadpool’
     std::shared_ptr<thread_pool::ThreadPool> thread_pool_;
                                  ^~~~~~~~~~
                                  Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:77:44: error: template argument 1 is invalid
     std::shared_ptr<thread_pool::ThreadPool> thread_pool_;
                                            ^
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp: In function ‘int main(int, char**)’:
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp:124:49: error: cannot convert ‘std::shared_ptr<thread_pool::ThreadPool>’ to ‘int’
     auto me = ram::createMinimizerEngine(15, 5, thread_pool);
                                                 ^~~~~~~~~~~
In file included from /mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp:2:
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:29:62: note:   initializing argument 3 of ‘std::unique_ptr<ram::MinimizerEngine> ram::createMinimizerEngine(uint8_t, uint8_t, int)’
     std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp:138:26: error: unable to deduce ‘auto&&’ from ‘overlaps’
     for (const auto& jt: overlaps) {
                          ^~~~~~~~
make[2]: *** [CMakeFiles/raven.dir/build.make:76: CMakeFiles/raven.dir/src/main.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:79: CMakeFiles/raven.dir/all] Error 2
make: *** [Makefile:152: all] Error 2
rvaser commented 4 years ago

Which Raven version do you have? Please run git log and paste the last commit.

ghost commented 4 years ago

Here it is commit 7f9d72c2cebdb77da01422bd5106d778d6ac254d

rvaser commented 4 years ago

Try running git submodule update and recompile.

ghost commented 4 years ago

still the same

rvaser commented 4 years ago

Try a fresh install with git clone --recursive https://github.com/lbcb-sci/raven raven_fresh.

ghost commented 4 years ago

still not working

[ 94%] Building CXX object CMakeFiles/raven.dir/src/main.cpp.o
In file included from /mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp:2:
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:29:50: error: ‘ThreadPool’ is not a member of ‘thread_pool’
     std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                  ^~~~~~~~~~
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:29:50: note: suggested alternative: ‘Threadpool’
     std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                  ^~~~~~~~~~
                                                  Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:29:50: error: ‘ThreadPool’ is not a member of ‘thread_pool’
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:29:50: note: suggested alternative: ‘Threadpool’
     std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                  ^~~~~~~~~~
                                                  Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:29:60: error: template argument 1 is invalid
     std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                            ^
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:61:54: error: ‘ThreadPool’ is not a member of ‘thread_pool’
         std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                      ^~~~~~~~~~
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:61:54: note: suggested alternative: ‘Threadpool’
         std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                      ^~~~~~~~~~
                                                      Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:61:54: error: ‘ThreadPool’ is not a member of ‘thread_pool’
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:61:54: note: suggested alternative: ‘Threadpool’
         std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                      ^~~~~~~~~~
                                                      Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:61:64: error: template argument 1 is invalid
         std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                                ^
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:64:38: error: ‘ThreadPool’ is not a member of ‘thread_pool’
         std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                      ^~~~~~~~~~
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:64:38: note: suggested alternative: ‘Threadpool’
         std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                      ^~~~~~~~~~
                                      Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:64:38: error: ‘ThreadPool’ is not a member of ‘thread_pool’
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:64:38: note: suggested alternative: ‘Threadpool’
         std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                      ^~~~~~~~~~
                                      Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:64:48: error: template argument 1 is invalid
         std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                                                ^
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:77:34: error: ‘ThreadPool’ is not a member of ‘thread_pool’
     std::shared_ptr<thread_pool::ThreadPool> thread_pool_;
                                  ^~~~~~~~~~
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:77:34: note: suggested alternative: ‘Threadpool’
     std::shared_ptr<thread_pool::ThreadPool> thread_pool_;
                                  ^~~~~~~~~~
                                  Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:77:34: error: ‘ThreadPool’ is not a member of ‘thread_pool’
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:77:34: note: suggested alternative: ‘Threadpool’
     std::shared_ptr<thread_pool::ThreadPool> thread_pool_;
                                  ^~~~~~~~~~
                                  Threadpool
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:77:44: error: template argument 1 is invalid
     std::shared_ptr<thread_pool::ThreadPool> thread_pool_;
                                            ^
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp: In function ‘int main(int, char**)’:
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp:124:45: error: cannot convert ‘std::shared_ptr<thread_pool::ThreadPool>’ to ‘int’
 auto me = ram::createMinimizerEngine(15, 5, thread_pool);
                                             ^~~~~~~~~~~
In file included from /mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp:2:
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/vendor/ram/include/ram/minimizer_engine.hpp:29:62: note:   initializing argument 3 of ‘std::unique_ptr<ram::MinimizerEngine> ram::createMinimizerEngine(uint8_t, uint8_t, int)’
     std::uint8_t w, std::shared_ptr<thread_pool::ThreadPool> thread_pool);
                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~
/mnt/sda1/Alessandro/18-10-2019_Raven_update/raven/src/main.cpp:138:26: error: unable to deduce ‘auto&&’ from ‘overlaps’
     for (const auto& jt: overlaps) {
                          ^~~~~~~~
make[2]: *** [CMakeFiles/raven.dir/build.make:76: CMakeFiles/raven.dir/src/main.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:79: CMakeFiles/raven.dir/all] Error 2
make: *** [Makefile:152: all] Error 2
rvaser commented 4 years ago

Okay, try replacing this with #include "thread_pool/thread_pool.hpp".

rvaser commented 4 years ago

If you still can't resolve this, send me your compiler version please.

ghost commented 4 years ago

Okay just tried it and it compiled!

ghost commented 4 years ago
N50 assembly.fa 
#contigs: 103
Total size: 105982982
N50: 6649164            L50: 5
N75: 1003025            L75: 20
N90: 372748             L90: 48
N99: 131153             L99: 93
Average: 1028960

doesn't seem to solve it (size should be around 90 Mb). So still 15 Mb too long.

rvaser commented 4 years ago

Yeah, the output is not modified as I stated above. Please paste the log that says Contigs = and Assembly length =.

ghost commented 4 years ago

Isn't it what I posted or you need the full log? Here it is

[raven::] loaded sequences 6.252488 s                                                                                                                                                                       
[raven::Graph::construct] minimized 0 - 18885 / 45244 30.821391 s                                                                                                                                           
[raven::Graph::construct] mapped sequences 23.054006 s                                                                                                                                                      
[raven::Graph::construct] minimized 18885 - 35303 / 45244 27.966116 s                                                                                                                                       
[raven::Graph::construct] mapped sequences 63.099570 s                                                                                                                                                      
[raven::Graph::construct] minimized 35303 - 45244 / 45244 16.563061 s                                                                                                                                       
[raven::Graph::construct] mapped sequences 60.254048 s                                                                                                                                                      
[raven::Graph::construct] annotated piles 0.328887 s                                                                                                                                                        
[raven::Graph::construct] removed contained sequences 0.101544 s                                                                                                                                            
[raven::Graph::construct] removed chimeric sequences 0.940186 s                                                                                                                                             
[raven::Graph::construct] cleared piles 0.068669 s                                                                                                                                                          
[raven::Graph::construct] rearranged sequences 0.085530 s                                                                                                                                                   
[raven::Graph::construct] minimized 0 - 7460 / 7460 14.194045 s                                                                                                                                             
[raven::Graph::construct] mapped valid sequences 10.355638 s                                                                                                                                                
[raven::Graph::construct] mapped invalid sequences 70.982857 s                                                                                                                                              
[raven::Graph::construct] updated piles 0.091207 s                                                                                                                                                          
[raven::Graph::construct] updated overlaps 0.003082 s
[raven::Graph::construct] rearranged sequences 0.015495 s
[raven::Graph::construct] removed false overlaps 0.356982 s
[raven::Graph::construct] stored nodes 7.818315 s
[raven::Graph::construct] stored edges 0.021052 s
[raven::Graph::construct] 327.157710 s
[raven::Graph::assemble] removed transitive edges 0.028465 s
[raven::Graph::assemble] removed tips and bubbles 37.648326 s
[raven::Graph::assemble] removed long edges 49.543947 s
[raven::Graph::assemble] 88.034824 s
Contigs = 103                                                                                                                                                                                               
Assembly length = 105982982                                                                                                                                                                                 
Contigs = 103                                                                                                                                                                                               
Assembly length = 105982982 

thank you

rvaser commented 4 years ago

Thanks for the full log, the last 4 lines were the ones I wanted. The same happens on one of my datasets. I am now investigating if there is a problem with the overlap step.

rvaser commented 4 years ago

@aderzelle, can you please add the line bellow here, compile and rerun raven? I think this should do the trick, it decreases the D. melanogaster assembly by 10Mbp.

num_overlaps[i] = std::min(overlaps[i].size(), static_cast<std::size_t>(16));
ghost commented 4 years ago

Hello, the size is indeed reduced from 105 to 98 Mb. What exactly does the code snippet change?

rvaser commented 4 years ago

This changes the creation of pile-o-grams which are used to filter out low quality regions and even whole reads. I did not properly handle this for reads that have a tiny amount of overlaps and artificially increased their coverage. They were not properly removed and somehow increased the assembly length.

Can you also try and change 0.001 to 0.0002 here, compile and run again? This could decrease the size a bit more, maybe :)

ghost commented 4 years ago

I tried the change but that did not reduce it further. Anyway, raven is producing the best assembly so far out of all the assembler we tried (and we tried a lot). I think I already asked but it would be nice if in the future raven could print the gfa as well, as inspecting gfa can be useful especially in "hard to assemble" cases. Thank you

rvaser commented 4 years ago

I will add the feature soon!:)

ghost commented 4 years ago

Hello, just to give you some update, the assembly looks fine with the exception of the KAT kmer completeness that is at 47%, while we get 49,99 with NextDeNovo. I mean, Raven runs fine, but I thought you would like to know.

cheers