malonge / RagTag

Tools for fast and flexible genome assembly scaffolding and improvement
MIT License
461 stars 47 forks source link

Ragtag.py merge RuntimeError: only complete components can be added to the graph error #156

Open benyoung93 opened 1 year ago

benyoung93 commented 1 year ago

Good morning

I have been trouble shooting this for a day to no avail, and as such am posting this question/query about my error message.

Command

ragtag.py merge ../../hifi_assemblay/all_contam_rem/Ofav_hifiasm_allcontrem.fa \
../../longstitch_new/Ofav_hifiasm_allcontrem.fa.k32.w100.z1000.trimmed_scafs.agp \
../../ragtag/ofav_scaffold/ragtag.scaffold.agp \
-u

I have checked that my agp files are correct using the inbuilt tool, and they are.

ragtag.py agpcheck ../../longstitch_new/Ofav_hifiasm_allcontrem.fa.k32.w100.z1000.trimmed_scafs.agp ../../ragtag/ofav_scaffold/ragtag.scaffold.agp

    DISCLAIMER:
    This utility performs most (but not all) checks necessary to validate an
    AGP v2.1 file: https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/

    Please additionally use the NCBI AGP validator for robust
    validation: https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Validation/

Fri Mar 10 10:23:37 2023 --- INFO: Checking /scratch/projects/omics/ofav_genome/longstitch_new/Ofav_hifiasm_allcontrem.fa.k32.w100.z1000.trimmed_scafs.agp ...
Fri Mar 10 10:23:37 2023 --- INFO: Check for /scratch/projects/omics/ofav_genome/longstitch_new/Ofav_hifiasm_allcontrem.fa.k32.w100.z1000.trimmed_scafs.agp is complete with no errors.

Fri Mar 10 10:23:37 2023 --- INFO: Checking /scratch/projects/omics/ofav_genome/ragtag/ofav_scaffold/ragtag.scaffold.agp ...
Fri Mar 10 10:23:37 2023 --- INFO: Check for /scratch/projects/omics/ofav_genome/ragtag/ofav_scaffold/ragtag.scaffold.agp is complete with no errors.

Interestingly, when using the NCBI validation tool I get warnings for the contigs that have not (i think) been scaffolded

19: | ptg000001l    1   32039455    1   W   ptg000001l  1   32039455    +
-- | --
  | object name (column 1) is the same as component_id (column 6)
20: | ptg000003l    1   35678035    1   W   ptg000003l  1   35678035    +
  | object name (column 1) is the same as component_id (column 6)
21: | ptg000004l    1   33295526    1   W   ptg000004l  1   33295526    +
  | object name (column 1) is the same as component_id (column 6)
22: | ptg000005l    1   36602845    1   W   ptg000005l  1   36602845    +
  | object name (column 1) is the same as component_id (column 6)
23: | ptg000006l    1   40246328    1   W   ptg000006l  1   40246328    +
  | object name (column 1) is the same as component_id (column 6)
24: | ptg000007l    1   24061036    1   W   ptg000007l  1   24061036    +
  | object name (column 1) is the same as component_id (column 6)
25: | ptg000008l    1   34276962    1   W   ptg000008l  1   34276962    +
  | object name (column 1) is the same as component_id (column 6)
26: | ptg000009l    1   24390148    1   W   ptg000009l  1   24390148    +

<div style="padding: 0px; margin: 0.5em 0px 0px; color: rgb(0, 0, 0); font-family: Arial, Helvetica, sans-serif; font-size: 12px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span style="padding: 0px; margin: 0px; font-size: 13.2px; font-weight: bold;">Statistics</span><span> </span>   <span> </span><a href="https://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_validate.cgi?#top" style="padding: 0px; margin: 0px; text-decoration: none; border: none; color: rgb(0, 0, 102); font-size: 10.8px; font-weight: bold;">back to top↑</a></div>

Objects:242- with single component:235 Scaffolds:242- with single component:235 | Objects: | 242 | - with single component: | 235 |   | Scaffolds: | 242 | - with single component: | 235 | Object names:242  ptg[000001..000249]l:229  ntLink_[0..6]:7  ptg[000015..000097]c:6 | Object names: | 242 | ptg[000001..000249]l: | 229 | ntLink_[0..6]: | 7 | ptg[000015..000097]c: | 6
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
Objects: | 242
- with single component: | 235
 
Scaffolds: | 242
- with single component: | 235
Object names: | 242
ptg[000001..000249]l: | 229
ntLink_[0..6]: | 7
ptg[000015..000097]c: | 6
Components (W):250  orientation +:244  orientation -:6  orientation ? (formerly 0):0  orientation na:0 | Components (W): | 250 | orientation +: | 244 | orientation -: | 6 | orientation ? (formerly 0): | 0 | orientation na: | 0 | Component names:250  ptg[000001..000250]l:244  ptg[000015..000097]c:6 | Component names: | 250 | ptg[000001..000250]l: | 244 | ptg[000015..000097]c: | 6
Components (W): | 250
orientation +: | 244
orientation -: | 6
orientation ? (formerly 0): | 0
orientation na: | 0
Component names: | 250
ptg[000001..000250]l: | 244
ptg[000015..000097]c: | 6
Gaps (N):3- do not break scaffold:3  scaffold, linkage yes:3 | Gaps (N): | 3 | - do not break scaffold: | 3 | scaffold, linkage yes: | 3 | Linkage evidence:   paired-ends:3 | Linkage evidence: |   | paired-ends: | 3
Gaps (N): | 3
- do not break scaffold: | 3
scaffold, linkage yes: | 3
Linkage evidence: |  
paired-ends: | 3

<br class="Apple-interchange-newline">Statistics     [back to top↑](https://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_validate.cgi?#top)
Objects:    242
- with single component:    235

Scaffolds:  242
- with single component:    235
Object names:   242
  ptg[000001..000249]l: 229
  ntLink_[0..6]:    7
  ptg[000015..000097]c: 6
Components (W): 250
  orientation +:    244
  orientation -:    6
  orientation ? (formerly 0):   0
  orientation na:   0
Component names:    250
  ptg[000001..000250]l: 244
  ptg[000015..000097]c: 6
Gaps (N):   3
- do not break scaffold:    3
  scaffold, linkage yes:    3
Linkage evidence:    
  paired-ends:  3

Would i need to remove these lines with similar information, and only keep in the relevant scaffolded lines (i.e. the top lines as shown below). Or is that a bad idea/100% wrong?

Output from ragtag.py merge

Fri Mar 10 10:20:27 2023 --- VERSION: RagTag v2.1.0
Fri Mar 10 10:20:27 2023 --- WARNING: This is a beta version of `ragtag merge`
Fri Mar 10 10:20:27 2023 --- CMD: ragtag.py merge ../../hifi_assemblay/all_contam_rem/Ofav_hifiasm_allcontrem.fa ../../longstitch_new/Ofav_hifiasm_allcontrem.fa.k32.w100.z1000.trimmed_scafs.agp ../../ragtag/ofav_scaffold/ragtag.scaffold.agp -u
Fri Mar 10 10:20:27 2023 --- INFO: Building the scaffold graph from the AGP files
Traceback (most recent call last):
  File "/nethome/bdy8/miniconda3/envs/ragtag_env/bin/ragtag_merge.py", line 430, in <module>
    main()
  File "/nethome/bdy8/miniconda3/envs/ragtag_env/bin/ragtag_merge.py", line 362, in main
    agp_multi_sg.add_agps(agp_list, in_weights=weight_list, exclusion_set=comp_exclusion_set)
  File "/nethome/bdy8/miniconda3/envs/ragtag_env/lib/python3.7/site-packages/ragtag_utilities/ScaffoldGraph.py", line 606, in add_agps
    for ap in self._get_assembly_points(agp, weight):
  File "/nethome/bdy8/miniconda3/envs/ragtag_env/lib/python3.7/site-packages/ragtag_utilities/ScaffoldGraph.py", line 518, in _get_assembly_points
    raise RuntimeError("only complete components can be added to the graph.")
RuntimeError: only complete components can be added to the graph.

Please also find some additional helpful information from my input files (fasta and the two agps).

head -20 /scratch/projects/omics/ofav_genome/longstitch_new/Ofav_hifiasm_allcontrem.fa.k32.w100.z1000.trimmed_scafs.agp
ntLink_0    1   3452964 1   W   ptg000012l  1   3452964 -
ntLink_0    3452965 3452984 2   N   20  scaffold    yes paired-ends
ntLink_0    3452985 3491959 3   W   ptg000127l  1   38975   +
ntLink_0    3491960 3491979 4   N   20  scaffold    yes paired-ends
ntLink_0    3491980 3516434 5   W   ptg000250l  1   24455   +
ntLink_1    1   19528   1   W   ptg000176l  1   19528   +
ntLink_1    19529   39704   2   W   ptg000221l  1   20176   -
ntLink_2    1   195567  1   W   ptg000067l  1   195567  +
ntLink_2    195568  227286  2   W   ptg000166l  223 31941   +
ntLink_3    1   39034   1   W   ptg000066l  1068    40101   -
ntLink_3    39035   74300   2   W   ptg000107l  979 36244   +
ntLink_4    1   538548  1   W   ptg000025l  1504    540051  -
ntLink_4    538549  607680  2   W   ptg000105l  3854    72985   +
ntLink_5    1   21959414    1   W   ptg000022l  761 21960174    -
ntLink_5    21959415    22011122    2   W   ptg000138l  1   51708   -
ntLink_6    1   11766301    1   W   ptg000002l  1   11766301    +
ntLink_6    11766302    11766321    2   N   20  scaffold    yes paired-ends
ntLink_6    11766322    11770210    3   W   ptg000125l  1   3889    +
ptg000001l  1   32039455    1   W   ptg000001l  1   32039455    +
ptg000003l  1   35678035    1   W   ptg000003l  1   35678035    +
head -20 /scratch/projects/omics/ofav_genome/longstitch_new/Ofav_hifiasm_allcontrem.fa.k32.w100.z1000.trimmed_scafs.agp
## agp-version 2.1
# AGP created by RagTag v2.1.0
NW_018148507.1_RagTag   1   46879   1   W   ptg000080l  1   46879   -
NW_018148518.1_RagTag   1   42037   1   W   ptg000142l  1   42037   -
NW_018148539.1_RagTag   1   31891   1   W   ptg000101l  1   31891   -
NW_018148547.1_RagTag   1   21649   1   W   ptg000143l  1   21649   -
NW_018148557.1_RagTag   1   20381   1   W   ptg000046l  1   20381   +
NW_018148565.1_RagTag   1   19361   1   W   ptg000194l  1   19361   -
NW_018148577.1_RagTag   1   32486   1   W   ptg000182l  1   32486   +
NW_018148578.1_RagTag   1   19816   1   W   ptg000110l  1   19816   +
NW_018148594.1_RagTag   1   20901   1   W   ptg000144l  1   20901   +
NW_018148600.1_RagTag   1   36461   1   W   ptg000224l  1   36461   +
NW_018148600.1_RagTag   36462   36561   2   U   100 scaffold    yes align_genus
NW_018148600.1_RagTag   36562   60206   3   W   ptg000219l  1   23645   +
NW_018148606.1_RagTag   1   26009   1   W   ptg000161l  1   26009   +
NW_018148606.1_RagTag   26010   26109   2   U   100 scaffold    yes align_genus
NW_018148606.1_RagTag   26110   69146   3   W   ptg000130l  1   43037   -
NW_018148608.1_RagTag   1   13225   1   W   ptg000245l  1   13225   +
NW_018148618.1_RagTag   1   35247   1   W   ptg000056l  1   35247   +
NW_018148627.1_RagTag   1   21186   1   W   ptg000168l  1   21186   -

Please also find the top 20 contig names from the assembly

grep ">" ../../hifi_assemblay/all_contam_rem/Ofav_hifiasm_allcontrem.fa | head -20
>ptg000001l
>ptg000002l
>ptg000003l
>ptg000004l
>ptg000005l
>ptg000006l
>ptg000007l
>ptg000008l
>ptg000009l
>ptg000010l
>ptg000011l
>ptg000012l
>ptg000013l
>ptg000014l
>ptg000015c
>ptg000016l
>ptg000017l
>ptg000018l
>ptg000019c
>ptg000020l

Finally, I list my conda environment with installed packages and versions :).

conda list
# packages in environment at /nethome/bdy8/miniconda3/envs/ragtag_env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
blas                      1.0                         mkl  
bzip2                     1.0.8                h7b6447c_0  
c-ares                    1.18.1               h7f8727e_0  
ca-certificates           2023.01.10           h06a4308_0  
certifi                   2022.12.7        py37h06a4308_0  
curl                      7.87.0               h5eee18b_0  
gdbm                      1.18                 hd4cb3f1_4  
intel-openmp              2021.4.0          h06a4308_3561  
intervaltree              3.1.0              pyhd3eb1b0_0  
k8                        0.2.5                h9a82719_1    bioconda
krb5                      1.19.4               h568e23c_0  
ld_impl_linux-64          2.38                 h1181459_1  
libcurl                   7.87.0               h91b91d3_0  
libdeflate                1.0                  h14c3975_1    bioconda
libedit                   3.1.20221030         h5eee18b_0  
libev                     4.33                 h7f8727e_1  
libffi                    3.4.2                h6a678d5_6  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libnghttp2                1.46.0               hce63b2e_0  
libssh2                   1.10.0               h8f2d780_0  
libstdcxx-ng              11.2.0               h1234567_1  
minimap2                  2.22                 h5bf99c6_0    bioconda
mkl                       2021.4.0           h06a4308_640  
mkl-service               2.4.0            py37h7f8727e_0  
mkl_fft                   1.3.1            py37hd3c417c_0  
mkl_random                1.2.2            py37h51133e4_0  
mummer                    3.23                          4    bioconda
ncurses                   6.4                  h6a678d5_0  
networkx                  2.6.3              pyhd3eb1b0_0  
numpy                     1.21.5           py37h6c91a56_3  
numpy-base                1.21.5           py37ha15fc14_3  
openssl                   1.1.1t               h7f8727e_0  
perl                      5.34.0               h5eee18b_2  
perl-threaded             5.32.1               hdfd78af_1    bioconda
pip                       22.3.1           py37h06a4308_0  
pysam                     0.15.3           py37hda2845c_1    bioconda
python                    3.7.16               h7a1cb2a_0  
ragtag                    2.1.0              pyhb7b1952_0    bioconda
readline                  8.2                  h5eee18b_0  
setuptools                65.6.3           py37h06a4308_0  
six                       1.16.0             pyhd3eb1b0_1  
sortedcontainers          2.4.0              pyhd3eb1b0_0  
sqlite                    3.40.1               h5082296_0  
tk                        8.6.12               h1ccaba5_0  
wheel                     0.38.4           py37h06a4308_0  
xz                        5.2.10               h5eee18b_1  
zlib                      1.2.13               h5eee18b_0  

Thank you in advance for any and all help, I am a little stumped with what is going on here.

Ben

benyoung93 commented 1 year ago

okay so have moved forward to another error after more trouble shooting. I renamed the column 1 so that the AGPs did not have the same identifiers as the later columns. This makes the NCBI validator throw 0 errors. Yay

I am now having similar issue as some other people

ragtag.py merge ../../hifi_assemblay/all_contam_rem/Ofav_hifiasm_allcontrem.fa ../../longstitch_new/longstitch.agp ../../ragtag/ofav_scaffold/ragtag.scaffold.agp
Sat Mar 11 14:59:18 2023 --- VERSION: RagTag v2.1.0
Sat Mar 11 14:59:18 2023 --- WARNING: This is a beta version of `ragtag merge`
Sat Mar 11 14:59:18 2023 --- CMD: ragtag.py merge ../../hifi_assemblay/all_contam_rem/Ofav_hifiasm_allcontrem.fa ../../longstitch_new/longstitch.agp ../../ragtag/ofav_scaffold/ragtag.scaffold.agp
Sat Mar 11 14:59:18 2023 --- WARNING: Without '-u' invoked, some component/object AGP pairs might share the same ID. Some external programs/databases don't like this. To ensure valid AGP format, use '-u'.
Sat Mar 11 14:59:18 2023 --- INFO: Building the scaffold graph from the AGP files
Traceback (most recent call last):
  File "/nethome/bdy8/miniconda3/envs/ragtag_env/bin/ragtag_merge.py", line 430, in <module>
    main()
  File "/nethome/bdy8/miniconda3/envs/ragtag_env/bin/ragtag_merge.py", line 362, in main
    agp_multi_sg.add_agps(agp_list, in_weights=weight_list, exclusion_set=comp_exclusion_set)
  File "/nethome/bdy8/miniconda3/envs/ragtag_env/lib/python3.7/site-packages/ragtag_utilities/ScaffoldGraph.py", line 606, in add_agps
    for ap in self._get_assembly_points(agp, weight):
  File "/nethome/bdy8/miniconda3/envs/ragtag_env/lib/python3.7/site-packages/ragtag_utilities/ScaffoldGraph.py", line 578, in _get_assembly_points
    raise ValueError("Input AGPs do not have the same set of components.")
ValueError: Input AGPs do not have the same set of components.

I could not identify, from the other issues, a fix for this or why this is occuring. It is the same input assembly that was used for the two different scaffold attempts.

Any and all help would be amazing and I am happy to provide more info if you need it :).

Ben

malonge commented 1 year ago

Hi there,

Sorry about the delays. I would use standard command line tools to check if both AGP files contain identical sets of AGP components. I understand that it's the same input assembly, but perhaps some of the contigs are left out depending on the scaffolding solution.

benyoung93 commented 1 year ago

Hi @malonge

First of all apologies was on holiday for the past two weeks.

I am going to be jumping back into this so will post updates and fixes here if I find them :).

Ben

benyoung93 commented 1 year ago

Good morning @malonge et al 🙂

Apologies for the delay but finally got back to this.

So I have successfully trouble shot this as per @malonge suggestion. The problem in the AGP file is in column 6. All contigs from the primary assembly are present and correct, what was actually the problem was different gap sizes input by the ntlink program (default 20) compared to the ragtag (default 100).

I am going to run ntlink with the gapsize set to 100, and then see if the merge will successfully work. This should fix the discrepancy and merge should hopefully work :).

I will post whether this works and then close the issue after that.

Thank you for the help

Ben

jolbi commented 1 year ago

Hi,

thank you for this easy to use and well documented tool!

I am struggling with similar issue and I cannot find a way around it. I want to use ragtag merge to merge a few reference-based agps produced by ragtag scaffold and one HiC-scaffolded agp. When I include HiC agp to ragtag merge I get the error:

Wed Aug  2 16:04:32 2023 --- VERSION: RagTag v2.1.0
Wed Aug  2 16:04:32 2023 --- WARNING: This is a beta version of `ragtag merge`
Wed Aug  2 16:04:32 2023 --- INFO: Building the scaffold graph from the AGP files
Traceback (most recent call last):
  File "/users/timg/.conda/envs/ragtag/bin/ragtag_merge.py", line 430, in <module>
    main()
  File "/users/timg/.conda/envs/ragtag/bin/ragtag_merge.py", line 362, in main
    agp_multi_sg.add_agps(agp_list, in_weights=weight_list, exclusion_set=comp_exclusion_set)
  File "/users/timg/.conda/envs/ragtag/lib/python3.9/site-packages/ragtag_utilities/ScaffoldGraph.py", line 606, in add_agps
    for ap in self._get_assembly_points(agp, weight):
  File "/users/timg/.conda/envs/ragtag/lib/python3.9/site-packages/ragtag_utilities/ScaffoldGraph.py", line 518, in _get_assembly_points
    raise RuntimeError("only complete components can be added to the graph.")
RuntimeError: only complete components can be added to the graph.

HiC scaffolding was done using YaHS and then I did some manual curation in Juicebox. Both the YaHS scaffolding agp and agp converted from Juicebox .assembly file give the same ragtag error.

I validated agps (from yahs and juicebox) with your tool and it says there are no errors. When I use the NCBI validator one type of error and several warnings appears: invalid value for linkage_evidence (column 9): proximity_ligation This is the error and I think the validator is not updated for AGP version 2.1, where _proximityligation was added as a valid value, so I think this is ok.

Warnings (one example per type): same component_id found on different scaffolds; previous occurance at line 815, in another object If I understand correctly this is because of assembly error correction step in yahs and because of some splitting in juicebox. Some original contigs were split and allocated to different scaffolds. Is this ok for ragtag?

component span appears out of order; preceding span: 1..100000 at line 265 I don't understand this one. These are the lines with above warning:

line_256    scaffold_1  13902924    14002923    265 W   contig_2705 1   100000  +
line_1243   scaffold_1  81152522    81177646    1243    W   contig_2705 100001  125125  -

duplicate component with non-draft type; preceding span: 438001..544668 at line 897 I also don't understand this one, these are the corresponding lines:

line_897    scaffold_1  55840816    55947483    897 W   contig_1439 438001  544668  -
line_1503   scaffold_1  103099110   103348109   1503    W   contig_1439 1   249000  -
line_5954   scaffold_8  31810598    31999597    399 W   contig_1439 249001  438000  +

Do you have any ideas what could be wrong? I don't know where else to look and what to try.

I read in you ragtag paper that you used agp converted from .assembly with some custom script. Did the manual curation involved breaking some contigs? Can you maybe provide the script you mention in the paper? I am using juicer post command (packed with yahs distribution) to make agp from assembly as described in yahs tutorial. I also tried juicebox_assembly_converter.py from https://github.com/phasegenomics/juicebox_scripts but it does not work with my .assembly and .fasta, so it looks like something may be wrong there in the first place...

Let me know if I should post some more info.

Thank you in advance for any help. Tim

mscharmann commented 3 months ago

Hi, this error message is not very clear, but waht it means is apparently (line 517 or scaffoldgraph.py): "if comp_len < self.get_component_len(agp_line.comp):"

-> so ragtag.py merge can not handle the situation whith broken components, i.e. contig breaks!! This seems like a thing that many people will run into. If any scaffolder breaks contigs (e.g. identified a mis-assembly), the resulting AGP can not be used with ragtag merge. Quite disappointing I think.

I see a workaround by first breaking both AGPs at these breakpoints, and give new component names to the resulting products. Perhaps another ragtag module can do that?