AssertionError: You have a collision in map names

tanghaibao / jcvi

Python library to facilitate genome assembly, annotation, and comparative genomics

BSD 2-Clause "Simplified" License

738 stars 187 forks source link

AssertionError: You have a collision in map names #98

Closed lplough closed 4 years ago

lplough commented 6 years ago

Running python -m jcvi.assembly.allmaps merge on my own maps (a male and female linkage map, formated identically to the CSV files in the example) resulted in a collision of map names...See output below.

I have a number of markers with runs of identical cM positions (not that many individuals int he mapping family) - is that the problem? Im not sure what the 'map names' are referring to?

10:23:31 [__init__] CACHEDIR=/data/users/root/.cache/matplotlib
10:23:31 [font_manager] Using fontManager instance from /data/users/root/.cache/matplotlib/fontList.json
10:23:31 [__init__] backend agg version v2.2
10:23:31 [allmaps] A total of 0 markers written to `./one_SNP_m5/m5_MF.bed`.
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/data/users/root/Downloads/code/jcvi/assembly/allmaps.py", line 1871, in <module>
    main()
  File "/data/users/root/Downloads/code/jcvi/assembly/allmaps.py", line 827, in main
    p.dispatch(globals())
  File "/data/users/root/Downloads/code/jcvi/apps/base.py", line 96, in dispatch
    globals[action](sys.argv[2:])
  File "/data/users/root/Downloads/code/jcvi/assembly/allmaps.py", line 1234, in merge
    assert len(maps) == len(mapnames), "You have a collision in map names"
AssertionError: You have a collision in map names

lplough commented 6 years ago

Or when I try to create the .bed file myself in bash, and run the path command (comparing the maps) I get this error:

Malformed marker name: lg1:54.77

11:58:08 [__init__] CACHEDIR=/data/users/root/.cache/matplotlib
11:58:08 [font_manager] Using fontManager instance from /data/users/root/.cache/matplotlib/fontList.json
11:58:09 [__init__] backend agg version v2.2
11:58:09 [base] Load file `m5_Fem.tab.lg1.N.bed`
11:58:09 [allmaps] Malformed marker name: lg1:54.77

Nothing interesting about that part of the file, which looks like this:

scf7180011599932    99121   99122   lg1:0.00
scf7180011603150    423933  423934  lg1:0.00
scf7180011607090    81009   81010   lg1:0.00
scf7180011615841    11567   11568   lg1:0.00
scf7180011602748    104580  104581  lg1:2.11
scf7180011607090    80044   80045   lg1:2.11
scf7180011599260    37483   37484   lg1:5.27
scf7180011599433    553925  553926  lg1:5.27
scf7180011600010    113199  113200  lg1:5.27
scf7180011600010    42869   42870   lg1:5.27

I can remove lines and the error seems to just jump to random markers/positions....

tanghaibao commented 6 years ago

@lplough The marker names in the .bed file that you created does not contain map name. The marker should be of the form: JMMale-23:64.691000, or mapname-chr:position. If you are about to write your own .bed file, you'll need to prepend the map name.

Understanding this might also help with the initial error that you had in the first place, you'll need to name the two .csv files to differently so that the two maps do not share the same name. The converter script truncated map names to the first dot, so seeing the error means that your two maps happend to have the same first part in their file names.

Haibao

lplough commented 6 years ago

Thanks. Hmmm. Well, changing the start of the map file names to be unique for the male and female maps did get rid of the collision error when merging the two. However, the output .bed file from the merge command was blank. no errors reported.

python -m jcvi.assembly.allmaps merge Male.M5.csv Fem.M5.csv -o MF2.bed
...
...
10:34:21 [__init__] CACHEDIR=/data/users/root/.cache/matplotlib
10:34:21 [font_manager] Using fontManager instance from /data/users/root/.cache/matplotlib/fontList.json
10:34:21 [__init__] backend agg version v2.2
10:34:21 [allmaps] A total of 0 markers written to `MF2.bed`.
10:34:21 [allmaps] Weights file written to `weights.txt`.
(my_project) root@genome-hunter:~/Bluecrab_linkagemap/one_SNP_m5/ALLMAPS# ls -lrth
total 404K
-rw-r--r-- 1 root root 198K May 22 10:31 Fem.M5.csv
-rw-r--r-- 1 root root 199K May 22 10:31 Male.M5.csv
-rw-r--r-- 1 root root    0 May 22 10:34 MF2.bed
-rw-r--r-- 1 root root   13 May 22 10:34 weights.txt

Regarding the making of my own .bed file, I was simply following the example of the formatting of another genetic map on the ALLmaps tutorial on using alternative maps:

Our command for conversion is:

python -m jcvi.assembly.geneticmap bed GeneticMap.out --switch

The converted GeneticMap.bed looks like:

Scaffold0006    3409259 3409260 lg0:0.0
Scaffold0006    3409280 3409281 lg0:0.0
Scaffold0006    2298176 2298177 lg0:2.289
Scaffold0006    3210402 3210403 lg0:2.289
Scaffold0006    3210441 3210442 lg0:2.289
Scaffold0006    4907582 4907583 lg0:2.289
Scaffold0006    6286872 6286873 lg0:5.376
Scaffold0006    8883770 8883771 lg0:5.376
Scaffold0006    8883794 8883795 lg0:5.376
Scaffold0006    8883810 8883811 lg0:5.376

I just used the male map and converted it to this format in the hopes that I could run the path module on it...Linkage groups are not prepended with a map name here...

UPDATE: I successfully ran the user created .bed file, I had to (as you suggested for the merged maps ;) prepend the linkage group with a map name, which could then be linked to the weights.txt file, which had only one map name (one line).

Sorry for all the questions. Hopefully others find this useful, but perhaps I wasnt reading into the details enough. Still, I can't explain the issue with the merging for male and female maps...no error thrown.

tanghaibao commented 6 years ago

@lplough We are aiming to create a file that mixes different maps to feed into path. Whether we prepare our input from .csv (use merge) or .bed (use mergebed). The goal is always to get a combined .bed with markers properly formatted. For example (this is the last step in the tutorial on using alternative maps):

python -m jcvi.assembly.allmaps mergebed OpticalMap.bed GeneticMap.bed Chickpea.bed -o Mt.bed

The merged Mt.bed file will have three tiers of evidence:

Scaffold0001    8954    15489   Chickpea-4:237.28811    Scaffold0001:8955
Scaffold0001    27702   28374   Chickpea-4:236.88576    Scaffold0001:27703
Scaffold0001    63097   63098   GeneticMap-lg0:135.93   Scaffold0001:63098
Scaffold0001    64515   67197   Chickpea-4:236.69478    Scaffold0001:64516
Scaffold0001    64596   64597   GeneticMap-lg0:135.93   Scaffold0001:64597
Scaffold0001    67504   69163   OpticalMap-OM_Chr1:294.56259    765.7   +       Scaffold0001:67505
Scaffold0001    69163   83408   OpticalMap-OM_Chr1:294.58039    765.7   +       Scaffold0001:69164

Note how in the fourth column every marker is prepended with a map name by mergebed. merge will do the same thing, but just on .csv. Not sure why merge gave you an empty file in your case. But if you just use mergebed and then path you should be fine.

Haibao

francicco commented 4 years ago

Hi @tanghaibao

I'm trying to use a "Genomic Map" a map that contains alignment coordinates between the query and the reference:

Scaffold    start   end genomic_pos_in_ref  orientation Pidentity
Scf0000001  15000   44999   GM_206001:6838877   +   97.1172
Scf0000001  0   14999   GM_206001:6688148   +   94.3433
Scf0000001  45000   183600  GM_206001:6950546   +   94.5592
Scf0000002  0   39999   GM_213001:16550175  +   95.2271
Scf0000002  100000  180191  GM_213001:16749269  +   94.6773
Scf0000002  85000   104999  GM_213001:16685005  +   94.5948
Scf0000002  40000   59999   GM_213001:16595589  +   93.4059
Scf0000002  55000   89999   GM_213001:16641753  +   92.0195
Scf0000003  150000  173909  GM_205001:450079    -   96.7465
Scf0000003  40000   144999  GM_203003:9144953   +   96.3359

with the synteny map.

First of all, I don't know if I'm introducing any bias or error in the procedure, apparently no (If I use this map only and then I check the synteny it seems alright), but when I merge the two:

python -m jcvi.assembly.allmaps mergebed $SPECIES.synteny.bed $SPECIES.on.$REFSP.ROmashmap.bed -o $SPECIES.scaff.bed

I got the same collision in map names error:

Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/mnt/storage/scratch/tk19812/software/jcvi/jcvi/assembly/allmaps.py", line 1872, in <module>
    main()
  File "/mnt/storage/scratch/tk19812/software/jcvi/jcvi/assembly/allmaps.py", line 828, in main
    p.dispatch(globals())
  File "/mnt/storage/home/tk19812/scratch/software/jcvi/jcvi/apps/base.py", line 100, in dispatch
    globals[action](sys.argv[2:])
  File "/mnt/storage/scratch/tk19812/software/jcvi/jcvi/assembly/allmaps.py", line 1274, in mergebed
    assert len(maps) == len(mapnames), "You have a collision in map names"
AssertionError: You have a collision in map names

How can I fix it? Thanks a lot F

tanghaibao commented 4 years ago

@francicco

Can you try to rename the second bed file to $SPECIES_on_$REFSP_ROmashmap.bed instead? The collision is due to ALLMAPS taking the prefix of the file (up to the first dot in the path) as the name of the map, which is in conflict with $SPECIES.synteny.bed. Let me know if it still fails.

Haibao

francicco commented 4 years ago

Works!!!! Thanks a lot F

P.S. is conceptually anything wrong in making this genomic map?

francicco commented 4 years ago

@tanghaibao

Now the problem is coming from jcvi.assembly.allmaps path, it throws this error

Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/mnt/storage/scratch/tk19812/software/jcvi/jcvi/assembly/allmaps.py", line 1872, in <module>
    main()
  File "/mnt/storage/scratch/tk19812/software/jcvi/jcvi/assembly/allmaps.py", line 828, in main
    p.dispatch(globals())
  File "/mnt/storage/home/tk19812/scratch/software/jcvi/jcvi/apps/base.py", line 100, in dispatch
    globals[action](sys.argv[2:])
  File "/mnt/storage/scratch/tk19812/software/jcvi/jcvi/assembly/allmaps.py", line 1402, in path
    weights = Weights(weightsfile, mapnames)
  File "/mnt/storage/scratch/tk19812/software/jcvi/jcvi/assembly/allmaps.py", line 626, in __init__
    pivot_weight, o, pivot = self.get_pivot(mapnames)
  File "/mnt/storage/scratch/tk19812/software/jcvi/jcvi/assembly/allmaps.py", line 647, in get_pivot
    for m, w in self.items() if m in mapnames)
  File "/mnt/storage/scratch/tk19812/software/jcvi/jcvi/assembly/allmaps.py", line 647, in <genexpr>
    for m, w in self.items() if m in mapnames)
ValueError: 'Hpac_on_Hmel_ROmashmap' is not in list

tanghaibao commented 4 years ago

@francicco

Make sure that this map name is the same in weights.txt and in the bedfile generated from merge..

Also per your previous question, I don't see anything wrong with your experiment if you only wish to look at the plot. You need to be cautious to actually release a genome based on two different synteny maps and think about what that means.

Haibao

francicco commented 4 years ago

@tanghaibao

Thanks, it seems like it's working

19:18:54 [base] Load file Hpac.scaff.bed 19:18:55 [allmaps] Map contains 21033 markers in 108 linkage groups. 19:18:55 [allmaps] Retained 21,013 of 21,033 (99.9%) clean markers.

19:18:55 [base] Load file `weights.txt`
19:18:55 [base] Imported 2 records from `weights.txt`.
19:18:55 [allmaps] Map weights: [('Hpac_on_Hmel_ROmashmap', 1), ('Hpac', 1)]
19:18:55 [allmaps] Linkage function: double-linkage
19:18:55 [allmaps] Partition LGs based on Hpac

Thanks a lot F