tanghaibao / jcvi

Python library to facilitate genome assembly, annotation, and comparative genomics
BSD 2-Clause "Simplified" License
722 stars 186 forks source link

KeyError: ('SCF_15', 6340, 3871002) during agp compress #45

Closed fbemm closed 4 years ago

fbemm commented 7 years ago

I am trying to compress to AGP files. They look like this:

SCF_1 1 14498021 1 W CTG_001 1 14498021 + SCF_1 14498022 14525698 2 N 27677 scaffold yes map SCF_1 14525699 16119896 3 W CTG_001 14498022 16092219 + SCF_15 1 135269 1 W CTG_015 1 135269 + SCF_15 135270 328631 2 N 193362 scaffold yes map SCF_15 328632 582743 3 W CTG_058 1 254112 + SCF_15 582744 582843 4 N 100 scaffold yes map SCF_15 582844 3871002 5 W CTG_059 1 3288159 +

chr_2 1 3871002 1 W SCF_15 1 3871002 - chr_2 3871003 3871102 2 N 100 scaffold yes map chr_2 3871103 19990998 3 W SCF_1 1 16119896 -

The following command failed

python -m jcvi.formats.agp compress test_1.agp test_2.agp

With

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/ebio/abt6_projects9/abt6_software/bin/jcvi/lib/python2.7/site-packages/jcvi-0.7.3-py2.7.egg/jcvi/formats/agp.py", line 1824, in main() File "/ebio/abt6_projects9/abt6_software/bin/jcvi/lib/python2.7/site-packages/jcvi-0.7.3-py2.7.egg/jcvi/formats/agp.py", line 741, in main p.dispatch(globals()) File "/ebio/abt6_projects9/abt6_software/bin/jcvi/local/lib/python2.7/site-packages/jcvi-0.7.3-py2.7.egg/jcvi/apps/base.py", line 87, in dispatch globalsaction File "/ebio/abt6_projects9/abt6_software/bin/jcvi/lib/python2.7/site-packages/jcvi-0.7.3-py2.7.egg/jcvi/formats/agp.py", line 791, in compress store[(a.component_id, a.component_beg, a.component_end)] KeyError: ('SCF_15', 1, 3871002)

Any idea would goes wrong here?

Thanks a lot, F

tanghaibao commented 7 years ago

@fbemm thanks. Unfortunately this tool is not as general as it sounds it is. This has been mostly been a tool to deal with chimeric scaffolds (an intermediate step during genome assembly). So what this does is it looks where the split scaffolds are in the final chromosome, then figure out the coordinates in the scaffolds before splitting. An example is shown below:

    Example:
    `a.agp` could contain split scaffolds:
    scaffold_0.1    1       600309  1       W       scaffold_0      1 600309  +

    `b.agp` could contain mapping to chromosomes:
    LG05    6435690 7035998 53      W       scaffold_0.1    1       600309  +

    The final AGP we want is:
    LG05    6435690 7035998 53      W       scaffold_0      1       600309  +

My bad. A more general tool or compressing multiple AGP files would be way more useful though, which I have not yet implemented.

Haibao

lassancejm commented 5 years ago

my guess is that the answer is no, but I am wondering whether you have tried to implement a tool to merge two AGP files (contig-to-scaffold and scaffold-to-chromosome). Annoyingly enough, this is what NCBI wants as part of genome submissions, i.e. a single file describing contig-to-chromosome relationships.

tanghaibao commented 5 years ago

@lassancejm

"agp compress" was a tool I wrote for the submission of Medicago genome assembly to NCBI (similar to your case) a while back. It was meant to compress a contig-to-scaffold and scaffold-to-chromosome, exactly as you imagined. However, my particular use case back then was different from yours. As a result, it probably doesn't handle all cases. I apologize for the false impression.

Haibao

lassancejm commented 5 years ago

No problem; I was looking for a ready-made solution, as I know your tools generally work great!