vgteam / GetBlunted

For bluntifying overlapped GFAs
13 stars 0 forks source link

ERROR: overlap length is > parent node/path length #37

Closed Sebastien-Raguideau closed 2 years ago

Sebastien-Raguideau commented 2 years ago

Hello,

I am still working with HIFI assembly graph. I am encountering an issue with this subgraph.

As in #36 I am doing a cigar correction step beforehand. From the error message I thought it might have been related to cigar defining an overlap length longer than unitig length. After checking for myself, this is not the case. There is however one overlap between 2 unitigs where overlap length is 1 nucleotide smaller than the unitig itself. When I run them by themselves, get_blunted doesn't throw any error.

Can you help me with that?

Best, Seb

P.S. I also have another issues, a subgraph which has so far gone on for a week without finishing while having 101 edges length.

rlorigro commented 2 years ago

Interesting. I will check it out. Regarding the speed issues, we are currently working on that

rlorigro commented 2 years ago

Hi Seb,

I found the issue with this and made a quick fix. Can you verify that this is working on your whole genome?

We also have encountered a new issue with the banded aligner after reviewing the output graph, and we are working on resolving that. See #40

Sebastien-Raguideau commented 2 years ago

Hi Ryan, I did so and the same issue cropped up. I cut things in bits again to identify a small example and surprisingly, it happened on the large subgraph I shared with you in #36, the slow_1223_edges.gfa. That was working before and now throwing an error after 10 min ERROR: overlap length is > parent node/path length by 1529

Also, I have this example where 2 independent subraphs in the same gfa file trigger the issue, but when using get_blunted on them independently everything works fine.

I saw hundreds and thousand of nodes as in #40 and thought It was my cigar being wrong, but didn't go through them yet.

rlorigro commented 2 years ago

OK I will see about these new cases then.

rlorigro commented 2 years ago

Hi @Sebastien-Raguideau

The latest PR should resolve this, but let me know if anything comes up again. I also added an executable alongside get_blunted that can help with extracting subgraphs from the original GFA.

@jeizenga is still looking into solutions for #40. We checked how the alignments look using clustalW manually, and we think your overlaps are not incorrect. The issue is just the huge difference in length among sequences in the global alignment.

Sebastien-Raguideau commented 2 years ago

After the last iteration of changes, I tried once again on the full graph, it now takes 7.8 hours instead of previous 4.5 and this issue is back. As before, issue on full graph but none on constituent subgraphs,

Here is the log:

WARNING: skipping overlap for which sum of cigar operations is > SINK node length: s23223.utg047301c->s23223.utg047301c

[get_blunted : 7.0 s elapsed] Computing adjacency components...
[get_blunted : 8.0 s elapsed] Total adjacency components: 73177
[get_blunted : 8.0 s elapsed] Computing biclique covers...
[get_blunted : 8.0 s elapsed] Total biclique covers: 22396
[get_blunted : 8.0 s elapsed] Duplicating node termini...
[get_blunted : 10.0 s elapsed] Harmonizing biclique edge orientations...
[get_blunted : 10.0 s elapsed] Aligning overlaps...

[get_blunted : 7.8 h elapsed] Splicing 22396 subgraphs...
[get_blunted : 7.8 h elapsed] Splicing overlapping overlap nodes...
terminate called after throwing an instance of 'std::runtime_error'
  what():  ERROR: overlap length is > parent node/path length by 18446744073709540433

I'll try to find a minimal example, but that may take some time. Let me know if you can do without.

rlorigro commented 2 years ago

Ok I can take a look, if you are able to share the input. If you would prefer, you can email me a temporary link at my github username @ucsc.edu.

If my hunch is correct, then it may be possible to catch this before the alignment step.

rlorigro commented 2 years ago

Unless this is the assembly you are referring to?

https://github.com/vgteam/GetBlunted/files/8079330/example.tar.gz

I can try it out and see

rlorigro commented 2 years ago

This doesn't appear to be the file you are using, so please send it when you get the chance.

rlorigro commented 2 years ago

We were able to reproduce this independently... should be resolved now

Sebastien-Raguideau commented 2 years ago

Cool, I had to ask for permission to share dataset and never got any answer. I tried the last version on my full graph and it finished in 8.8h without issue. :thumbsup:

rlorigro commented 2 years ago

Thanks for testing it out. Do you happen to know what the peak RAM usage was?

Since 8.8hrs seems like too long still, I implemented multithreading in the latest PR #46, which seems to be working well. Please keep us updated with any issues you have, esp regarding #40

Sebastien-Raguideau commented 2 years ago

I ran it twice, once with 50 cores, it took 1.3h and about 50G ram and another with 10 cores, log is below, it took 1.6h and about 20G of ram. The graph is about 1.5G.

[get_blunted : 7.0 s elapsed] Computing adjacency components...
[get_blunted : 7.0 s elapsed] Total adjacency components: 73177
[get_blunted : 7.0 s elapsed] Computing biclique covers...
[get_blunted : 7.0 s elapsed] Total biclique covers: 22397
[get_blunted : 7.0 s elapsed] Duplicating node termini...
[get_blunted : 9.0 s elapsed] Harmonizing biclique edge orientations...
[get_blunted : 9.0 s elapsed] Aligning overlaps...
[get_blunted : 1.6 h elapsed] Splicing 22397 subgraphs...
[get_blunted : 1.6 h elapsed] Splicing overlapping overlap nodes...
[get_blunted : 1.6 h elapsed] Inferring provenance...
[get_blunted : 1.6 h elapsed] Writing provenance to file: test_map.tsv
[get_blunted : 1.6 h elapsed] Destroying duplicated nodes...
[get_blunted : 1.6 h elapsed] Writing bluntified GFA to file to STDOUT

        Command being timed: "get_blunted -i graph_correct_cigs.gfa -V -t 10 -p test_map.tsv"
        User time (seconds): 16945.02
        System time (seconds): 20951.41
        Percent of CPU this job got: 671%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:34:03
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 20334392
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 3926786958
        Voluntary context switches: 649737
        Involuntary context switches: 63382
        Swaps: 0
        File system inputs: 0
        File system outputs: 3456752
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
rlorigro commented 2 years ago

Ok thanks for the info. Could you tell whether CPU usage was consistently high? If there is a file you can share I can do more testing.

Sebastien-Raguideau commented 2 years ago

In term of consistency of cpu, I can't tell much more than what the log say, that is that as a mean of execution time, cpu load incurred was 6.71 when given 10. Well, if I understand well output of /usr/bin/time -v. Yup I got authorization to share the graph, here it is.