currently being developed

isovic / raptor

Graph-based mapping of long sequences, noisy or HiFi.

BSD 3-Clause Clear License

54 stars 2 forks source link

currently being developed #12

Open jdmontenegro opened 3 years ago

jdmontenegro commented 3 years ago

Hi Ivan,

I am a big fan of your graphmap project and have used it myself to sort out different pathways/haplotypes in an organelle assembly graph. I am facing a new problem which I think raptor would be better suited to tackle. I have produced a assembly graph of a plant genome. I would like to use a platinum standard linear reference of the genome to find chromosome paths in my assembly graph instead of using traditional reference guided scaffolding.

So, in essence, I would like to align each platinum standard chromosome to find paths in the assembly graph and produce chromosome-level contigs (or sub-chromosome contigs). that way, I would also be able to find large insertions that are absent in the reference genome but present in my target genome and connected to chromosome path.

Do you think raptor would be a suitable tool for this? Or should I go and try to adapt graphmap for this purpose?

All the best!

Juan D. Montenegro

isovic commented 3 years ago

Hi Juan,

I'm happy to hear that! :-)

Based on your description, I would say that Raptor would be a good fit for your use case! In this case, the assembled contigs and the GFA graph should be provided as the "target" to the mapper, and the platinum chromosomes as the "query" sequences. (This is because the graph is constructed with respect to the target sequences.) Raptor will construct the graph from the GFA, map query sequences onto the targets linearly, then chain the mappings according to the provided graph and finally align (if the --align option is specified).

Let me know how it goes, I'm excited to hear about the results. Also, if you run into any problems let me know as well and I'll try to fix them (this repo has been stale for a few months due to other more pressing work unfortunately).

Best regards, Ivan.

jdmontenegro commented 3 years ago

Dear Ivan,

Thank you for your reply. I will give raptor a go. I currently have the gfa with the sequences embedded and also the contigs which should be the longest unique paths in the assembly graph and are constructed from 1 or more nodes in the graph. That should be OK, right, or should I sump the node sequences and use that as the linear target instead of the contigs?

Cheers,

Juan D.

El vie, 30 abr 2021 a las 18:44, Ivan Sovic @.***>) escribió:

Hi Juan,

I'm happy to hear that! :-)

Based on your description, I would say that Raptor would be a good fit for your use case! In this case, the assembled contigs and the GFA graph should be provided as the "target" to the mapper, and the platinum chromosomes as the "query" sequences. (This is because the graph is constructed with respect to the target sequences.) Raptor will construct the graph from the GFA, map query sequences onto the targets linearly, then chain the mappings according to the provided graph and finally align (if the --align option is specified).

Let me know how it goes, I'm excited to hear about the results. Also, if you run into any problems let me know as well and I'll try to fix them (this repo has been stale for a few months due to other more pressing work unfortunately).

Best regards, Ivan.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/isovic/raptor/issues/12#issuecomment-830220732, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHSLOS5KDPTZAE5NGVX6RTTLLM6NANCNFSM434A7EKQ .

jdmontenegro commented 3 years ago

Hi Ivan,

I am getting this issue trying to install raptor in my $HOME in my cluster:

PermissionError: [Errno 13] Permission denied: '/usr/include/ncbi-vdb' make[1]: [Makefile:41: configure] Error 2 make[1]: Leaving directory '/home/jmontenegro/soft/raptor' make: [Makefile:49: meson-release-pb] Error 2

It seems meson is trying to install something somewhere I have no permission. Is there a way to tell the compiler to install everything locally, instead of system-wide?

Cheers,

Juan D.

isovic commented 3 years ago

Hi Juan,

Can you copy/paste the exact command you used to compile raptor? It shouldn't be installing anything by default.

Here's how I would recommend compiling it:

git clone https://github.com/isovic/raptor.git
cd raptor
make

This should automatically build everything. Note that you'll need to have the Meson build system installed.

Thank you for your reply. I will give raptor a go. I currently have the gfa with the sequences embedded and also the contigs which should be the longest unique paths in the assembly graph and are constructed from 1 or more nodes in the graph. That should be OK, right, or should I sump the node sequences and use that as the linear target instead of the contigs?

Raptor builds the graph from Segments (S lines) as nodes and Link(L)/Edge(E) lines as edges (depends on the GFA format). Segments here are the sequences where mapping will be performed on. It's best if the segments are as long as possible, and have as little overlap as possible for mapping. From your description, I'm not sure about a few things:

Is your input graph in GFA1?
Is the graph composed of reads connected by overlaps as edges? Or is it composed of unitigs/contigs which are linked in the graph by edges?

For GFA-1, Raptor will not take the Path lines into account when building the graph (only P). I would personally recommend using GFA-2 format instead of GFA-1 because it actually encodes the coordinates of edges on each segment in the graph, making the graph much more useful.

If the graph represents contigs/unitigs via Path lines but the actual graph is represented as reads and their overlaps, it would be best to create a new graph where contiguous sequences are represented in the "assembled" form, just to avoid potential sensitivity loss due to all the overlapping in the graph.

P.S. If your GFA file contains sequences in it, you can provide it simultaneously both as the -g parameter and as the -r. The -g will read the graph but not the sequences, and the -r will read the reference sequences form the GFA (but not the graph).

Best regards, Ivan.

jdmontenegro commented 3 years ago

Hi Ivan, thank you for your reply,

I installed meson and ninja into my $HOME using pip pip install meson ninja --user

then I moved to the the raptor root and compiled by: make release-pb

should I try to use only "make"?

I am not sure about the version of the GFA and I cannot find it in the documentation of the tool. I am using flye 2.8.2 to perform denovo assembly of pacbio CLR reads. The assembler produces linear contigs/scaffolds as fasta and an assembly graph. The contigs/scaffolds are paths inside the graph and are usually formed by 1 or more segments. The contigs are extended in the graph as much as possible until it reaches a dead end or an unsolved repeat. It is probably a GFA1 type, but I already asked the question in the forum, so I should hear back from them shortly.

Thank you for the last tip. The GFA does contain sequences embedded, so I can provide it as both them.

I will try to recompile and I'll get back to you soon.

Cheers,

Juan D.

El lun, 3 may 2021 a las 19:18, Ivan Sovic @.***>) escribió:

Hi Juan,

Can you copy/paste the exact command you used to compile raptor? It shouldn't be installing anything by default.

Here's how I would recommend compiling it:

git clone https://github.com/isovic/raptor.git cd raptor make

This should automatically build everything. Note that you'll need to have the Meson build system installed.

Thank you for your reply. I will give raptor a go. I currently have the gfa with the sequences embedded and also the contigs which should be the longest unique paths in the assembly graph and are constructed from 1 or more nodes in the graph. That should be OK, right, or should I sump the node sequences and use that as the linear target instead of the contigs?

Raptor builds the graph from Segments (S lines) as nodes and Link(L)/Edge( E) lines as edges (depends on the GFA format). Segments here are the sequences where mapping will be performed on. It's best if the segments are as long as possible, and have as little overlap as possible for mapping. From your description, I'm not sure about a few things:

Is your input graph in GFA1?

Is the graph composed of reads connected by overlaps as edges? Or is it composed of unitigs/contigs which are linked in the graph by edges?

For GFA-1, Raptor will not take the Path lines into account when building the graph (only P). I would personally recommend using GFA-2 format instead of GFA-1 because it actually encodes the coordinates of edges on each segment in the graph, making the graph much more useful.

If the graph represents contigs/unitigs via Path lines but the actual graph is represented as reads and their overlaps, it would be best to create a new graph where contiguous sequences are represented in the "assembled" form, just to avoid potential sensitivity loss due to all the overlapping in the graph.

P.S. If your GFA file contains sequences in it, you can provide it simultaneously both as the -g parameter and as the -r. The -g will read the graph but not the sequences, and the -r will read the reference sequences form the GFA (but not the graph).

Best regards, Ivan.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/isovic/raptor/issues/12#issuecomment-831405147, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHSLOUPV3AOJURNJU7CWOTTL3LFTANCNFSM434A7EKQ .

jdmontenegro commented 3 years ago

Well, finally got it to compile correctly. Perhaps I had some unmet dependency when trying to compile with bam compatibility.

Thank you, I will give it a try now

Cheers,

Juan D.

El lun, 3 may 2021 a las 20:49, Juan Daniel Montenegro Cabrera (< @.***>) escribió:

Hi Ivan, thank you for your reply,

I installed meson and ninja into my $HOME using pip pip install meson ninja --user

then I moved to the the raptor root and compiled by: make release-pb

should I try to use only "make"?

I am not sure about the version of the GFA and I cannot find it in the documentation of the tool. I am using flye 2.8.2 to perform denovo assembly of pacbio CLR reads. The assembler produces linear contigs/scaffolds as fasta and an assembly graph. The contigs/scaffolds are paths inside the graph and are usually formed by 1 or more segments. The contigs are extended in the graph as much as possible until it reaches a dead end or an unsolved repeat. It is probably a GFA1 type, but I already asked the question in the forum, so I should hear back from them shortly.

Thank you for the last tip. The GFA does contain sequences embedded, so I can provide it as both them.

I will try to recompile and I'll get back to you soon.

Cheers,

Juan D.

El lun, 3 may 2021 a las 19:18, Ivan Sovic @.***>) escribió:

Hi Juan,

Can you copy/paste the exact command you used to compile raptor? It shouldn't be installing anything by default.

Here's how I would recommend compiling it:

git clone https://github.com/isovic/raptor.git cd raptor make

This should automatically build everything. Note that you'll need to have the Meson build system installed.

Thank you for your reply. I will give raptor a go. I currently have the gfa with the sequences embedded and also the contigs which should be the longest unique paths in the assembly graph and are constructed from 1 or more nodes in the graph. That should be OK, right, or should I sump the node sequences and use that as the linear target instead of the contigs?

Raptor builds the graph from Segments (S lines) as nodes and Link(L )/Edge(E) lines as edges (depends on the GFA format). Segments here are the sequences where mapping will be performed on. It's best if the segments are as long as possible, and have as little overlap as possible for mapping. From your description, I'm not sure about a few things:

Is your input graph in GFA1?

Is the graph composed of reads connected by overlaps as edges? Or is it composed of unitigs/contigs which are linked in the graph by edges?

For GFA-1, Raptor will not take the Path lines into account when building the graph (only P). I would personally recommend using GFA-2 format instead of GFA-1 because it actually encodes the coordinates of edges on each segment in the graph, making the graph much more useful.

If the graph represents contigs/unitigs via Path lines but the actual graph is represented as reads and their overlaps, it would be best to create a new graph where contiguous sequences are represented in the "assembled" form, just to avoid potential sensitivity loss due to all the overlapping in the graph.

P.S. If your GFA file contains sequences in it, you can provide it simultaneously both as the -g parameter and as the -r. The -g will read the graph but not the sequences, and the -r will read the reference sequences form the GFA (but not the graph).

Best regards, Ivan.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/isovic/raptor/issues/12#issuecomment-831405147, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHSLOUPV3AOJURNJU7CWOTTL3LFTANCNFSM434A7EKQ .

jdmontenegro commented 3 years ago

I just tried So the compilation worked, but the tests failed:

$ make unit
make install BDIR=meson-release-pb
make[1]: Entering directory '/home/jmontenegro/soft/raptor'
ninja -C meson-release-pb reconfigure
ninja: Entering directory `meson-release-pb'
ninja: error: loading 'build.ninja': No such file or directory
make[1]: *** [Makefile:34: install] Error 1
make[1]: Leaving directory '/home/jmontenegro/soft/raptor'
make: *** [Makefile:78: release-pb] Error 2

$ make cram
git submodule update --init third-party/cram
Submodule 'third-party/cram' (https://github.com/brodie/cram.git) registered for path 'third-party/cram'
Cloning into '/home/jmontenegro/soft/raptor/third-party/cram'...
Submodule path 'third-party/cram': checked out '59c164dfa6cbe4845aad2c958e77695073d5e802'
scripts/cram -E tests/cram/local/*.t tests/cram/local-graph/*.t
###################################
### This script is a workaround
### to run Cram tests, without actually
### installing the Python package.
### Based on: https://github.com/PacificBiosciences/unanimity/blob/develop/scripts/cram
###################################

CRAM_SCRIPT_PATH="$( cd "$(dirname "$0")" ; pwd -P )"
+++ dirname scripts/cram
++ cd scripts
++ pwd -P
+ CRAM_SCRIPT_PATH=/home/jmontenegro/soft/raptor/scripts
PROJECT_DIR=${CRAM_SCRIPT_PATH}/../
+ PROJECT_DIR=/home/jmontenegro/soft/raptor/scripts/../
BIN_DIR=${CRAM_SCRIPT_PATH}/../install/bin/
+ BIN_DIR=/home/jmontenegro/soft/raptor/scripts/../install/bin/
CRAM_DIR=${PROJECT_DIR}/third-party/cram/
+ CRAM_DIR=/home/jmontenegro/soft/raptor/scripts/..//third-party/cram/
LD_LIBRARY_PATH=${BIN_DIR}/../lib64:${BIN_DIR}/../lib:${LD_LIBRARY_PATH}
+ LD_LIBRARY_PATH=/home/jmontenegro/soft/raptor/scripts/../install/bin//../lib64:/home/jmontenegro/soft/raptor/scripts/../install/bin//../lib:/home/jmontenegro/soft/raptor/install/lib64:/home/jmontenegro/soft/raptor/install/lib:/soft/Python-3.9.4/lib:/soft/fuentes/magicblas/ncbi-magicblast-1.5.0-src/c++/local/ncbi-vdb-2.9.4-1/lib64:
export PROJECT_DIR
+ export PROJECT_DIR
export CRAM_DIR
+ export CRAM_DIR
export BIN_DIR
+ export BIN_DIR
export LD_LIBRARY_PATH
+ export LD_LIBRARY_PATH

# ls -l ${BIN_DIR}/../lib64
# ls -l ${BIN_DIR}/../lib

ldd ${BIN_DIR}/raptor
+ ldd /home/jmontenegro/soft/raptor/scripts/../install/bin//raptor
    linux-vdso.so.1 (0x00007fff737cc000)
    libz.so.1 => /lib64/libz.so.1 (0x00007fca939f3000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fca9365e000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fca932dc000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fca930c4000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fca92ea4000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fca92ae2000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fca93c0a000)
ldd -r ${BIN_DIR}/raptor
+ ldd -r /home/jmontenegro/soft/raptor/scripts/../install/bin//raptor
    linux-vdso.so.1 (0x00007ffeec5cc000)
    libz.so.1 => /lib64/libz.so.1 (0x00007f737a3e9000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f737a054000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f7379cd2000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f7379aba000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f737989a000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f73794d8000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f737a600000)

# if we have cram already, use that
if [[ -x "$(which cram)" ]]; then
    exec cram "$@"
fi
++ which cram
which: no cram in (/home/jmontenegro/soft/raptor/install/bin:/soft/Python-3.9.4/bin/:/home/jmontenegro/.local/bin:/home/jmontenegro/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/c3/bin:/sbin:/bin)
+ [[ -x '' ]]

exec env PYTHONPATH=$CRAM_DIR $CRAM_DIR/scripts/cram "$@"
+ exec env PYTHONPATH=/home/jmontenegro/soft/raptor/scripts/..//third-party/cram/ /home/jmontenegro/soft/raptor/scripts/..//third-party/cram//scripts/cram -E tests/cram/local/test-10-bugfixes-various.t tests/cram/local/test-11-test_inputs.t tests/cram/local/test-12-suppl-secondary.t tests/cram/local/test-13-hifi-ovl.t tests/cram/local/test-1-single-real-read.t tests/cram/local/test-2-linear-mapping.t tests/cram/local/test-3-linear-aln.t tests/cram/local/test-4-no_duplicates.t tests/cram/local/test-5-raptor-reshape.t tests/cram/local/test-6-linear-simple-exact-match.t tests/cram/local/test-7-overlapping.t tests/cram/local/test-8-raptor-fetch.t tests/cram/local/test-9-graphsim.t tests/cram/local-graph/test-1-linear-aln.t tests/cram/local-graph/test-2-circular-aln.t tests/cram/local-graph/test-3-transcriptome-linear.t tests/cram/local-graph/test-4-transcriptome-circular.t tests/cram/local-graph/test-5-asm-graph-bubbles.t tests/cram/local-graph/test-6-graph-aln.t tests/cram/local-graph/test-7-unrolled.t
/usr/bin/env: ‘python’: No such file or directory
make: *** [Makefile:118: cram-local] Error 127

Any suggestion?

isovic commented 3 years ago

Hi Juan,

I am not sure about the version of the GFA and I cannot find it in the documentation of the tool. I am using flye 2.8.2 to perform denovo assembly of pacbio CLR reads. The assembler produces linear contigs/scaffolds as fasta and an assembly graph. The contigs/scaffolds are paths inside the graph and are usually formed by 1 or more segments. The contigs are extended in the graph as much as possible until it reaches a dead end or an unsolved repeat. It is probably a GFA1 type, but I already asked the question in the forum, so I should hear back from them shortly.

This sounds good, I'd say.

The GFA version should (hopefully) also be written at the first line of the GFA file (the header). Unless they did not write that. Another simple way to quickly figure it out, does the graph contain L lines (GFA-1) or E lines (GFA-2) for edges?

Thank you for the last tip. The GFA does contain sequences embedded, so I can provide it as both them.

In this case, just provide the sequences as you normally would, e.g. -r contigs.fasta -g graph.gfa.

make release-pb should I try to use only "make"?

The make release-pb will compile the code with the BAM parsing feature, which depends on the PacBio PBBAM library. It's a bit of a heavy dependency (any source files to compile), most of which come from the Htslib dependency. It also requires Boost installed on your system. (Meson has a nice way of resolving most of the dependencies, but Boost is still an exception unfortunately.)

So, if you don't really require the BAM support, I'd recommend compiling without it. It will make the process quicker.

I just tried So the compilation worked, but the tests failed:

That's pretty strange. Can you try manually running the unit test binary without the Makefile like so:

meson-release/tests_raptor

As for the Cram tests, it appears it cannot find Python. Is it installed on your system?

Best regards, Ivan.

jdmontenegro commented 3 years ago

Hi Ivan,

Thank you for your reply. I think we finally got it installed in the cluster.

I found a header in the GFA file: H VN:Z:1.0

is that a version 1 GFA? It also contains "L" and "S" lines, but no "E" lines.

Cheers,

Juan D

El mié, 5 may 2021 a las 9:12, Ivan Sovic @.***>) escribió:

Hi Juan,

I am not sure about the version of the GFA and I cannot find it in the documentation of the tool. I am using flye 2.8.2 to perform denovo assembly of pacbio CLR reads. The assembler produces linear contigs/scaffolds as fasta and an assembly graph. The contigs/scaffolds are paths inside the graph and are usually formed by 1 or more segments. The contigs are extended in the graph as much as possible until it reaches a dead end or an unsolved repeat. It is probably a GFA1 type, but I already asked the question in the forum, so I should hear back from them shortly.

This sounds good, I'd say.

The GFA version should (hopefully) also be written at the first line of the GFA file (the header). Unless they did not write that. Another simple way to quickly figure it out, does the graph contain L lines (GFA-1) or E lines (GFA-2) for edges?

Thank you for the last tip. The GFA does contain sequences embedded, so I can provide it as both them.

In this case, just provide the sequences as you normally would, e.g. -r contigs.fasta -g graph.gfa.

make release-pb should I try to use only "make"?

The make release-pb will compile the code with the BAM parsing feature, which depends on the PacBio PBBAM library. It's a bit of a heavy dependency (any source files to compile), most of which come from the Htslib dependency. It also requires Boost installed on your system. (Meson has a nice way of resolving most of the dependencies, but Boost is still an exception unfortunately.)

So, if you don't really require the BAM support, I'd recommend compiling without it. It will make the process quicker.

I just tried So the compilation worked, but the tests failed:

That's pretty strange. Can you try manually running the unit test binary without the Makefile like so:

meson-release/tests_raptor

As for the Cram tests, it appears it cannot find Python. Is it installed on your system?

Best regards, Ivan.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/isovic/raptor/issues/12#issuecomment-832466732, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHSLOTDCOKSAN6JAIPXAKDTMDVXRANCNFSM434A7EKQ .

isovic commented 3 years ago

Hi Juan,

Happy to hear that!

H VN:Z:1.0 Yes, that header line defines this file as GFA-1. (In case it's useful, the concrete spec can be found here: https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md .)

Best regards, Ivan.

jdmontenegro commented 3 years ago

Hi ivan,

We have ran the alignment and obtained the paf file, how would you recommend to produce paths in the graph from the paf alignment?

cheers,

Juan D

El mié, 12 may 2021 a las 9:24, Ivan Sovic @.***>) escribió:

Hi Juan,

Happy to hear that!

H VN:Z:1.0 Yes, that header line defines this file as GFA-1. (In case it's useful, the concrete spec can be found here: https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md .)

Best regards, Ivan.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/isovic/raptor/issues/12#issuecomment-839530753, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHSLOQMO7OFBP4RUSHNRD3TNIUJDANCNFSM434A7EKQ .

isovic commented 3 years ago

Hi Juan,

The primary+supplementary alignments should give you the path in the graph. The graph alignment is a linear walk through nodes (segments). If the alignment spans multiple segments in the graph, it will be split-aligned. Each of these alignment portions corresponds to a single chunk of query aligned to a segment in the graph. You can sort the alignments by query coordinates to see the order in which the graph is being traversed. Also, there are three tags in the output which mark the order of the alignments in the graph:

pi - Path ID. Primary path has the ID of 0. Secondary paths (if they are reported) will have pi > 0.
pj - ID of the segment in the aligned path. This provides the linear ordering of aligned chunks within one path.
pn - Number of alignment segments within the path (the number of pj portions).

Note that, if you do not specify the --align option, then only the mapping will be reported, without alignment. (It still works with the graph, but you won't get the CIGAR out or perfectly accurate split coordinates.)

Best regards, Ivan.