pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
191 stars 39 forks source link

odgi extract error #516

Open GeorgeBGM opened 1 year ago

GeorgeBGM commented 1 year ago

Hi,I got the following error with the command(odgi extract -i GRAPH -b BED -o - -c 0 -d 10 --full-range -t 30 -P ), how should I fix it?

image

Besides, how can I keep more additional information (not just location information) in the BED file to the sequence ID in using the B command?

AndreaGuarracino commented 1 year ago

Hi @George-du, can you share the input graph and BED files to reproduce the issue?

Which kind of information do you want to keep and where?

GeorgeBGM commented 1 year ago
  Hi, the **input graph file** comes from project **_HPRC_**(https://s3-us-west-2.amazonaws.com/human-pangenomics /pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.gfa.gz),**the BED files** can be downloaded via this link(https://www.aliyundrive.com/s/2zvSR8iXaUH).

The operational process is as follows:

Step1: vg convert -gfW hprc-v1.0-mc-grch38.gfa(input graph file) -t 40 >hprc-v1.0-mc-grch38_Non-W.gfa odgi build -g hprc-v1.0-mc-grch38_Non-W.gfa -o hprc-v1.0-mc-grch38_Non-W.og --threads 40 -P Step2: odgi extract -i hprc-v1.0-mc-grch38_Non-W.og -b BED(the BED files) -o - -c 0 -d 10 --full-range -t 30 -P|odgi paths -t 30 -i - -f >need.fa

The first question, the program produces the above error, how should I solve it?

The second question, how can I keep the fourth column of annotation information(Bolded parts) in the BED file into the Reads ID of the finally extracted FASTA sequence?

The BED file format is as follows: GRCh38.chr1 38256 40294 GRCh38.1.chr1:38256-40294 GRCh38.chr1 41394 42273 GRCh38.2.chr1:41394-42273

The finally extracted FASTA sequence:

sample:1-1000 GRCh38.1.chr1:38256-40294 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

I'm looking forward for your reply. Thanks.

GeorgeBGM commented 1 year ago

Hi, I'm curious if I described the problem clearly and if there are some suggestions about the solution to this problem?

GeorgeBGM commented 1 year ago

Hi, I'm concerned about a question and would like to know if there are any suggestions? Is this error being reported because of a haplotype breakpoint region?

Thank you in advance.

subwaystation commented 1 year ago

Hi @George-du,

you have a mc graph as input, I am not sure it will work with our tools, but I am investigating. Regarding your 2nd question, we don't support this functionality at the moment. odgi extract will format the path names according to your BED query. What you want as annotation should already be the final path name, if I am not mistaken. So no need to worry.

GeorgeBGM commented 1 year ago

Hi, I'm glad to hear from you. I also used the PGGB graph as input(https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/hprc-v1.0-pggb.gfa.gz),but it‘s outputting similar error message. Is it true that maybe this error reporting is related to some haplotype breakpoint region and how should we fix this issue, thanks a lot for some attempts being made from your side. As for the 2nd question, I'm aiming for sequential de-redundancy for specific regions, so I want to keep some annotation information. For example, for chr1:1-200 and chr1:300-500 regions, I would like to perform de-redundancy on the pan-genomic sequences extracted from chr1:1-200 and chr1:300-500 regions of the genome respectively, but the extraction of each region is very slow, and if all the regions are extracted together, I can't distinguish which pan-genomic sequences belong to the chr1:1-200 region and those sequences belong to chr1:300-500 region.

Thank you very much for your efforts in trying to solve this problem. Best,Du

GeorgeBGM commented 1 year ago

hi @subwaystation @AndreaGuarracino ,

Dear Developer, I would like to ask if there is some progress on this issue as of now? Or is there another solution suggested?

subwaystation commented 1 year ago

Hi @George-du,

sorry for the long wait. What I was able to do so far:

wget https://s3-us-west-2.amazonaws.com/human-pangenomics /pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.gfa.gz
gunzip hprc-v1.0-mc-grch38.gfa.gz
vg convert -gfW hprc-v1.0-mc-grch38.gfa -t 28 > hprc-v1.0-mc-grch38_Non-W.gfa
odgi build -g hprc-v1.0-mc-grch38_Non-W.gfa -o hprc-v1.0-mc-grch38_Non-W.og -t 28 -P

However, I can't download your BED file, since the font is an Asian language and don't know where to click.

Regarding your 2nd question: I still can't follow. Could you please prove a tiny step-by-step example of what you want to do and what your expectations are? Thanks!

GeorgeBGM commented 1 year ago

Hi @subwaystation @AndreaGuarracino,

I'm glad to hear from you.

As for the first question, the BED file can be downloaded via this link (https://github.com/George-du/Human_PAN/blob/main/Tissue_All_Non-Coding-region.bed).It is worth noting that some changes were made to the chromosome names in the BED file for matching with the GRAPH file.

As for the second question, my expected results are shown in the following steps.

Step1: head -n 1 Tissue_All_Non-Coding-region.bed(https://github.com/George-du/Human_PAN/blob/main/Tissue_All_Non-Coding-region.bed)

  The BED file format is as follows:
   GRCh38.chr1  38256   40294   GRCh38.1.chr1:38256-40294
   GRCh38.chr1  41394   42273   GRCh38.2.chr1:41394-42273

Step2: odgi extract -i hprc-v1.0-mc-grch38_Non-W.og -b Tissue_All_Non-Coding-region.bed -o - -c 0 -d 10 --full-range -t 30 -P |odgi paths -t 30 -i - -f >need.fa

   **The finally extracted FASTA sequence:**
   >**GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample:1-1000 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample2:4-3000 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample3:1000-2000 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >**GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample:4-1090 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample2:1-3020 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample3:1010-2100 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

    For the subsequent subinterval de-redundant sequences, I hope the output format as follows(keep annotation information beyond the third column in the BED file to the Fasta file).

   **The expected finally extracted FASTA sequence:**
   >**GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample:1-1000  **GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample2:4-3000  **GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample3:1000-2000  **GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >**GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample:4-1090  **GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample2:1-3020  **GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample3:1010-2100  **GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Thank you in advance.

Best,Du

GeorgeBGM commented 1 year ago

Hi, developers! @subwaystation

Has there been any progress on this issue so far? Or some suggestions?

I'm looking forward for your reply. Thanks!

subwaystation commented 1 year ago

Hi @George-du, I am sorry you have to wait so long. Vacation time.... I will come back to you next week!

GeorgeBGM commented 1 year ago

Hi, developers! @subwaystation, Nice to hear from you! Thanks!

subwaystation commented 1 year ago

Dear @George-du,

First question.

When running odgi extract one runs into a Segmentation fault (core dumped):

odgi extract -i hprc-v1.0-mc-grch38_Non-W.og -b Tissue_All_Non-Coding-region.bed -o hprc-v1.0-mc-grch38_Non-W.og.extract.og -c 0 -d 10 --full-range -t 28 -P
[odgi::extract] extracting path ranges 100.00% @ 4.34e+01 bp/s elapsed: 00:00:11:21 remain: 00:00:00:00
[odgi::extract] collecting all nodes in the path range 100.00% @ 2.40e+06 bp/s elapsed: 00:00:00:03 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 1 (max 3) 100.00% @ 4.70e+02 bp/s elapsed: 00:00:00:57 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 2 (max 3) 100.00% @ 4.66e+02 bp/s elapsed: 00:00:00:58 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 3 (max 3) 100.00% @ 4.70e+02 bp/s elapsed: 00:00:00:57 remain: 00:00:00:00
Segmentation fault (core dumped)

I am re-running with GDB now. @AndreaGuarracino Do you have any idea why? Would a duplicate entry in the BED file cause such a problem? @George-du Since you are extracting from 2 chromosomes only, you can try odgi explode and then work with smaller graphs.

I finally understood your 2nd question. You need this because you have two BED entries which target the same region in the reference, but you annotated them from different samples.

grep 77707167 Tissue_All_Non-Coding-region.bed
GRCh38.chr1     77707167        77707582        GRCh38.9835.chr1:77707167-77707582
GRCh38.chr1     77707167        77707582        GRCh38.9836.chr1:77707167-77707582

However, odgi does only store the path name which usually corresponds to the FASTA identifier. We don't store additional meta information. I am not sure how odgi extract behaves when you enter the same identifier twice. Maybe @AndreaGuarracino knows more? So after the extraction and the FASTA conversion, you have to add your meta information in the BED manually to the FASTA identifier. I am afraid there is currently no other way. Else we would have to change the whole odgi data structure model.

subwaystation commented 1 year ago

The full stack trace.

odgi: /home/ubuntu/software/odgi/git/master/src/odgi.cpp:201: odgi::graph_t::path_metadata_t& odgi::graph_t::get_path_metadata(const handlegraph::path_handle_t&) const: Assertion `false' failed.
--Type <RET> for more, q to quit, c to continue without paging--

Thread 27 "odgi" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fe93b5e6700 (LWP 715612)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff774c859 in __GI_abort () at abort.c:79
#2  0x00007ffff774c729 in __assert_fail_base (fmt=0x7ffff78e2588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x555555bc29e3 "false", file=0x555555bc29b0 "/home/ubuntu/software/odgi/git/master/src/odgi.cpp", line=201,
    function=<optimized out>) at assert.c:92
#3  0x00007ffff775dfd6 in __GI___assert_fail (assertion=0x555555bc29e3 "false", file=0x555555bc29b0 "/home/ubuntu/software/odgi/git/master/src/odgi.cpp", line=201,
    function=0x555555bc2940 "odgi::graph_t::path_metadata_t& odgi::graph_t::get_path_metadata(const handlegraph::path_handle_t&) const") at assert.c:101
#4  0x000055555559df02 in odgi::graph_t::get_path_metadata (this=0x7fffffffb540, path=...) at /home/ubuntu/software/odgi/git/master/src/odgi.cpp:201
#5  0x00005555555a4c88 in odgi::graph_t::append_step (this=0x7fffffffb540, path=..., to_append=...) at /home/ubuntu/software/odgi/git/master/src/odgi.cpp:1326
#6  0x00005555558a387f in odgi::<lambda(odgi::graph_t&, std::vector<handlegraph::path_handle_t, std::allocator<handlegraph::path_handle_t> >*, const std::vector<handlegraph::path_handle_t, std::allocator<handlegraph::path_handle_t> >&, odgi::graph_t&, std::vector<odgi::path_range_t, std::allocator<odgi::path_range_t> >, std::vector<std::pair<long unsigned int, long unsigned int>, std::allocator<std::pair<long unsigned int, long unsigned int> > >, uint64_t, uint64_t, bool, bool, uint64_t, uint64_t, uint64_t, bool, bool)>::<lambda(const handlegraph::handle_t&)>::operator()(const handlegraph::handle_t &) const (__closure=0x7fe939600000, handle=...)
    at /home/ubuntu/software/odgi/git/master/src/subcommand/extract_main.cpp:635
#7  0x00005555558ac575 in std::_Function_handler<void(const handlegraph::handle_t&), odgi::main_extract(int, char**)::<lambda(odgi::graph_t&, std::vector<handlegraph::path_handle_t>*, const std::vector<handlegraph::path_handle_t>&, odgi::graph_t&, std::vector<odgi::path_range_t>, std::vector<std::pair<long unsigned int, long unsigned int> >, uint64_t, uint64_t, bool, bool, uint64_t, uint64_t, uint64_t, bool, bool)>::<lambda(const handlegraph::handle_t&)> >::_M_invoke(const std::_Any_data &, const handlegraph::handle_t &) (__functor=..., __args#0=...) at /usr/include/c++/9/bits/std_function.h:300
#8  0x00005555555eb183 in std::function<void (handlegraph::handle_t const&)>::operator()(handlegraph::handle_t const&) const (this=0x7fe93b5e4440, __args#0=...) at /usr/include/c++/9/bits/std_function.h:688
#9  0x00005555555e0379 in odgi::algorithms::for_handle_in_path_range(odgi::graph_t const&, handlegraph::path_handle_t, long, long, std::function<void (handlegraph::handle_t const&)> const&) (source=..., path_handle=..., start=77707167,
    end=77707582, lambda=...) at /home/ubuntu/software/odgi/git/master/src/algorithms/subgraph/extract.cpp:207
#10 0x00005555558aeb26 in odgi::<lambda(odgi::graph_t&, std::vector<handlegraph::path_handle_t, std::allocator<handlegraph::path_handle_t> >*, const std::vector<handlegraph::path_handle_t, std::allocator<handlegraph::path_handle_t> >&, odgi::graph_t&, std::vector<odgi::path_range_t, std::allocator<odgi::path_range_t> >, std::vector<std::pair<long unsigned int, long unsigned int>, std::allocator<std::pair<long unsigned int, long unsigned int> > >, uint64_t, uint64_t, bool, bool, uint64_t, uint64_t, uint64_t, bool, bool)>::_ZZN4odgi12main_extractEiPPcENKUlRNS_7graph_tEPSt6vectorIN11handlegraph13path_handle_tESaIS6_EERKS8_S3_S4_INS_12path_range_tESaISC_EES4_ISt4pairImmESaISG_EEmmbbmmmbbE1_clES3_S9_SB_S3_SE_SI_mmbbmmmbb._omp_fn.2(void) () at /home/ubuntu/software/odgi/git/master/src/subcommand/extract_main.cpp:628
#11 0x00007ffff797686e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
--Type <RET> for more, q to quit, c to continue without paging--
#12 0x00007ffff7924609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007ffff7849133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
AndreaGuarracino commented 1 year ago

Duplicated path ranges are not supported. Just pushed a fix to check them https://github.com/pangenome/odgi/pull/525

subwaystation commented 1 year ago

Indeed, after removing the duplicated BED entries it works.

odgi extract -i hprc-v1.0-mc-grch38_Non-W.og -b Tissue_All_Non-Coding-region.uniq.bed -o hprc-v1.0-mc-grch38_Non-W.og.extract.og -c 0 -d 10 --full-range -t 28 -P
[odgi::extract] extracting path ranges 100.00% @ 4.22e+01 bp/s elapsed: 00:00:11:41 remain: 00:00:00:00
[odgi::extract] collecting all nodes in the path range 100.00% @ 2.40e+06 bp/s elapsed: 00:00:00:03 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 1 (max 3) 100.00% @ 4.43e+02 bp/s elapsed: 00:00:01:01 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 2 (max 3) 100.00% @ 4.43e+02 bp/s elapsed: 00:00:01:01 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 3 (max 3) 100.00% @ 4.43e+02 bp/s elapsed: 00:00:01:01 remain: 00:00:00:00
[odgi::extract] adding connecting edges 100.00% @ 1.31e+06 bp/s elapsed: 00:00:00:05 remain: 00:00:00:00
[odgi::extract] adding subpaths 100.00% @ 9.48e+02 bp/s elapsed: 00:00:01:25 remain: 00:00:00:00
[odgi::extract] checking missing edges and empty subpaths 100.00% @ 3.77e+03 bp/s elapsed: 00:00:00:09 remain: 00:00:00:00
GeorgeBGM commented 1 year ago

Dear @subwaystation @AndreaGuarracino,

Wow, that sounds good! I'll try it as soon as possible. I will also try the odgi explode method. Thanks!

Best ,Du

GeorgeBGM commented 12 months ago

Dear @subwaystation @AndreaGuarracino,

I would like to ask if extracting the sequence based on intervals (-b BED file) will extract the excess sequence?(https://github.com/pangenome/odgi/issues/526)

When there are more intervals, a large number of sequences will be extracted, how can I de-redundancy these sequences? Is it possible to divide the data into several parts and redundant them separately or is there any other suggestion? Or do you recommend using VCF files directly?

subwaystation commented 12 months ago

Dear @George-du,

as far as I understood only with -E you will extract excess sequence.

I think there are several options for you:

How would like the VCF file to interact with the graph? Or would you like to project the graph into a VCF and then work with it? If you want to take a look at the variants within your ranges, I would rather first project into a VCF using vg deconstruct and then take a look at the variants. So you don't miss something.

GeorgeBGM commented 12 months ago

Dear @subwaystation @AndreaGuarracino,

I will test it as you suggest. Best, Du

ekg commented 12 months ago

There is odgi procbed to project path annotations into the final path names after extract.

On Fri, Sep 1, 2023, 05:52 George-du @.***> wrote:

Dear @subwaystation https://github.com/subwaystation @AndreaGuarracino https://github.com/AndreaGuarracino,

I will test it as you suggest. Best, Du

— Reply to this email directly, view it on GitHub https://github.com/pangenome/odgi/issues/516#issuecomment-1702559346, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQELHRTSHLQ4YKYIE6VDXYG473ANCNFSM6AAAAAAZ4ZCSVU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

GeorgeBGM commented 11 months ago

Dear @subwaystation @AndreaGuarracino @ekg

I tried it as your suggested and found that this is much slower, which seems to extract one region at a time. Using the following command,1289 regions can be extracted in 20h. _odgi extract -i $GRAPH -b Tissue_All_Non-HERV_Coding-region_Merge_Anno_needshuffle.bed -c 0 -d 0 -t 30 -P -s -E Maybe splitting chromosomes regions (odgi explode) or even cutting the region BED file will speed up this operation, but It will generate a lot of intermediate processing files.

Best, Du

subwaystation commented 10 months ago

@AndreaGuarracino Any ideas how to speed this up?