schneebergerlab / syri

Synteny and Rearrangement Identifier
https://schneebergerlab.github.io/syri/
MIT License
305 stars 36 forks source link

KeyError thrown when running example with v1.6.3 #165

Closed xtan1221 closed 1 year ago

xtan1221 commented 1 year ago

I successfully installed the lasted version on my macOS Monterey (chip M1 Pro) from bioconda. Then I tried to run the pipeline.sh under example/ folder with MUMmer:

  1. the MUMmer modules run successfully to generate the out.filtered.coords and out.filtered.delta files
  2. then I tried to run syri -c out.filtered.coords -d out.filtered.delta -r refgenome -q qrygenome, the following error message were generated:
    
    multiprocessing.pool.RemoteTraceback: 
    """
    Traceback (most recent call last):
      File "/Users/tan/opt/anaconda3/envs/syri_env/lib/python3.9/multiprocessing/pool.py", line 125, in worker
        result = (True, func(*args, **kwds))
      File "/Users/tan/opt/anaconda3/envs/syri_env/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
        return list(map(*args))
      File "syri/pyxFiles/findshv.pyx", line 108, in syri.findshv.getsnps
    KeyError: 'CP006105.2'
    """
    The above exception was the direct cause of the following exception:
    Traceback (most recent call last):
      File "/Users/tan/opt/anaconda3/envs/syri_env/bin/syri", line 6, in <module>
        main(sys.argv[1:])
      File "/Users/tan/opt/anaconda3/envs/syri_env/lib/python3.9/site-packages/syri/scripts/syri.py", line 319, in main
        syri(args)
      File "/Users/tan/opt/anaconda3/envs/syri_env/lib/python3.9/site-packages/syri/scripts/syri.py", line 246, in syri
        getshv(args, coords, chrlink)
      File "syri/pyxFiles/findshv.pyx", line 203, in syri.findshv.getshv
      File "syri/pyxFiles/findshv.pyx", line 204, in syri.findshv.getshv
      File "syri/pyxFiles/findshv.pyx", line 205, in syri.findshv.getshv
      File "/Users/tan/opt/anaconda3/envs/syri_env/lib/python3.9/multiprocessing/pool.py", line 364, in map
        return self._map_async(func, iterable, mapstar, chunksize).get()
      File "/Users/tan/opt/anaconda3/envs/syri_env/lib/python3.9/multiprocessing/pool.py", line 771, in get
        raise self._value
    KeyError: 'CP006105.2'

I also tried the manual installation:

conda install cython=0.29.23 numpy=1.21.2 scipy=1.6.2 pandas=1.2.4 python-igraph=0.9.1 psutil=5.8.0 pysam=0.16.0.1 matplotlib=3.3.4 pip install .

it succeeded. Then I tested with the example again. the same error was encountered:

multiprocessing.pool.RemoteTraceback:  """ Traceback (most recent call last):   File "/Users/tan/opt/anaconda3/envs/syri_master/lib/python3.9/multiprocessing/pool.py", line 125, in worker     result = (True, func(*args, *kwds))   File "/Users/tan/opt/anaconda3/envs/syri_master/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar     return list(map(args))   File "syri/pyxFiles/findshv.pyx", line 108, in syri.findshv.getsnps KeyError: 'CP006105.2' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last):   File "/Users/tan/opt/anaconda3/envs/syri_master/bin/syri", line 6, in     main(sys.argv[1:])   File "/Users/tan/opt/anaconda3/envs/syri_master/lib/python3.9/site-packages/syri/scripts/syri.py", line 326, in main     syri(args)   File "/Users/tan/opt/anaconda3/envs/syri_master/lib/python3.9/site-packages/syri/scripts/syri.py", line 252, in syri     getshv(args, coords, chrlink)   File "syri/pyxFiles/findshv.pyx", line 203, in syri.findshv.getshv   File "syri/pyxFiles/findshv.pyx", line 204, in syri.findshv.getshv   File "syri/pyxFiles/findshv.pyx", line 205, in syri.findshv.getshv   File "/Users/tan/opt/anaconda3/envs/syri_master/lib/python3.9/multiprocessing/pool.py", line 364, in map     return self._map_async(func, iterable, mapstar, chunksize).get()   File "/Users/tan/opt/anaconda3/envs/syri_master/lib/python3.9/multiprocessing/pool.py", line 771, in get     raise self._value KeyError: 'CP006105.2'



Any idea how to fix this?
mnshgl0110 commented 1 year ago

Hi @xtan1221 , I rerun the pipeline and it finished without errors. I get the following files:

-rwxrwx---  1 goel grp_schneeberger  12M Nov  2 15:55 GCA_000146045.2_R64_genomic.fna
-rwxrwx---  1 goel grp_schneeberger  12M Nov  2 15:55 GCA_000977955.2_Sc_YJM1447_v1_genomic.fna
-rwxrwx---  1 goel grp_schneeberger  12M Nov  2 15:55 GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.filtered
lrwxrwxrwx  1 goel grp_schneeberger   31 Nov  2 15:55 refgenome -> GCA_000146045.2_R64_genomic.fna
lrwxrwxrwx  1 goel grp_schneeberger   50 Nov  2 15:55 qrygenome -> GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.filtered
-rwxrwx---  1 goel grp_schneeberger 486K Nov  2 15:57 out.delta
-rwxrwx---  1 goel grp_schneeberger 159K Nov  2 15:57 out.filtered.delta
-rwxrwx---  1 goel grp_schneeberger  67K Nov  2 15:57 out.filtered.coords
-rwxrwx---  1 goel grp_schneeberger  352 Nov  2 16:51 mapids.txt
-rwxrwx---  1 goel grp_schneeberger 7.1M Nov  2 16:52 syri.out
-rwxrwx---  1 goel grp_schneeberger  12M Nov  2 16:52 syri.vcf
-rwxrwx---  1 goel grp_schneeberger  541 Nov  2 16:52 syri.summary
-rwxrwx---  1 goel grp_schneeberger  11K Nov  2 16:52 syri.log

Can you please check whether you have all files above mapids.txt and whether the sizes of those files match?

xtan1221 commented 1 year ago

Hi @mnshgl0110, thank you for your response.

I do have all the files above mapids.txt with the same size before I run syri;

-rw-r--r--  1 tan  staff    12M Nov  2 17:27 GCA_000146045.2_R64_genomic.fna
-rw-r--r--  1 tan  staff    12M Nov  2 17:27 GCA_000977955.2_Sc_YJM1447_v1_genomic.fna
-rw-r--r--  1 tan  staff    12M Nov  2 17:27 GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.filtered
-rw-r--r--  1 tan  staff   487K Nov  2 17:29 out.delta
-rw-r--r--  1 tan  staff    67K Nov  2 17:29 out.filtered.coords
-rw-r--r--  1 tan  staff   159K Nov  2 17:29 out.filtered.delta
lrwxr-xr-x  1 tan  staff    50B Nov  2 17:28 qrygenome@ -> GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.filtered
lrwxr-xr-x  1 tan  staff    31B Nov  2 17:27 refgenome@ -> GCA_000146045.2_R64_genomic.fna

then I run syri, the same error occurred:

Reading Coords - WARNING - Chromosomes IDs do not match.
Reading Coords - WARNING - Matching them automatically. For each reference genome, most similar query genome will be selected. Check mapids.txt for mapping used.
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/tan/opt/anaconda3/envs/syri_master/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/tan/opt/anaconda3/envs/syri_master/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "syri/pyxFiles/findshv.pyx", line 108, in syri.findshv.getsnps
KeyError: 'CP006105.2'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/tan/opt/anaconda3/envs/syri_master/bin/syri", line 6, in <module>
    main(sys.argv[1:])
  File "/Users/tan/opt/anaconda3/envs/syri_master/lib/python3.9/site-packages/syri/scripts/syri.py", line 326, in main
    syri(args)
  File "/Users/tan/opt/anaconda3/envs/syri_master/lib/python3.9/site-packages/syri/scripts/syri.py", line 252, in syri
    getshv(args, coords, chrlink)
  File "syri/pyxFiles/findshv.pyx", line 203, in syri.findshv.getshv
  File "syri/pyxFiles/findshv.pyx", line 204, in syri.findshv.getshv
  File "syri/pyxFiles/findshv.pyx", line 205, in syri.findshv.getshv
  File "/Users/tan/opt/anaconda3/envs/syri_master/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/Users/tan/opt/anaconda3/envs/syri_master/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
KeyError: 'CP006105.2'

Below are the generated files:

-rw-r--r--  1 tan  staff    12M Nov  2 17:27 GCA_000146045.2_R64_genomic.fna
-rw-r--r--  1 tan  staff    12M Nov  2 17:27 GCA_000977955.2_Sc_YJM1447_v1_genomic.fna
-rw-r--r--  1 tan  staff    12M Nov  2 17:27 GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.filtered
lrwxr-xr-x  1 tan  staff    31B Nov  2 17:27 refgenome@ -> GCA_000146045.2_R64_genomic.fna
lrwxr-xr-x  1 tan  staff    50B Nov  2 17:28 qrygenome@ -> GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.filtered
-rw-r--r--  1 tan  staff   487K Nov  2 17:29 out.delta
-rw-r--r--  1 tan  staff   159K Nov  2 17:29 out.filtered.delta
-rw-r--r--  1 tan  staff    67K Nov  2 17:29 out.filtered.coords
-rw-r--r--  1 tan  staff   352B Nov  2 17:36 mapids.txt
-rw-r--r--  1 tan  staff   274B Nov  2 17:36 invOut.txt
-rw-r--r--  1 tan  staff   684B Nov  2 17:36 TLOut.txt
-rw-r--r--  1 tan  staff   838B Nov  2 17:36 invTLOut.txt
-rw-r--r--  1 tan  staff   6.7K Nov  2 17:36 dupOut.txt
-rw-r--r--  1 tan  staff   728B Nov  2 17:36 invDupOut.txt
-rw-r--r--  1 tan  staff    31K Nov  2 17:36 ctxOut.txt
-rw-r--r--  1 tan  staff    19K Nov  2 17:36 synOut.txt
-rw-r--r--  1 tan  staff    95K Nov  2 17:36 sv.txt
-rw-r--r--  1 tan  staff    10K Nov  2 17:36 notAligned.txt
-rw-r--r--  1 tan  staff   2.6K Nov  2 17:36 syri.log
-rw-r--r--  1 tan  staff     0B Nov  2 17:36 snps_init.txt

it looks like a bunch of intermediate files are generated (I assume) but it stopped when finding the SNPs and small indels (snps_init.txt file is empty as shown above). Below is the content in syri.log file:

2022-11-02 17:41:14,037 - Reading Coords - INFO - syri:135 - Reading input from .tsv file
2022-11-02 17:41:14,047 - Reading Coords - WARNING - syri:135 - Chromosomes IDs do not match.
2022-11-02 17:41:14,048 - Reading Coords - WARNING - syri:135 - Matching them automatically. For each reference genome, most similar query genome will be selected. Check mapids.txt for mapping used.
2022-11-02 17:41:14,211 - Reading Coords - INFO - syri:135 - setting CP006105.2 as BK006934.2
2022-11-02 17:41:14,211 - Reading Coords - INFO - syri:135 - setting CP004488.2 as BK006935.2
2022-11-02 17:41:14,212 - Reading Coords - INFO - syri:135 - setting CP004578.2 as BK006936.2
2022-11-02 17:41:14,212 - Reading Coords - INFO - syri:135 - setting CP006317.1 as BK006937.2
2022-11-02 17:41:14,212 - Reading Coords - INFO - syri:135 - setting CP004738.2 as BK006938.2
2022-11-02 17:41:14,213 - Reading Coords - INFO - syri:135 - setting CP004833.2 as BK006939.2
2022-11-02 17:41:14,213 - Reading Coords - INFO - syri:135 - setting CP004968.2 as BK006940.2
2022-11-02 17:41:14,213 - Reading Coords - INFO - syri:135 - setting CP005272.2 as BK006941.2
2022-11-02 17:41:14,213 - Reading Coords - INFO - syri:135 - setting CP005061.2 as BK006942.2
2022-11-02 17:41:14,214 - Reading Coords - INFO - syri:135 - setting CP005174.1 as BK006943.2
2022-11-02 17:41:14,214 - Reading Coords - INFO - syri:135 - setting CP005369.2 as BK006944.2
2022-11-02 17:41:14,214 - Reading Coords - INFO - syri:135 - setting CP006421.1 as BK006945.2
2022-11-02 17:41:14,215 - Reading Coords - INFO - syri:135 - setting CP005470.2 as BK006946.2
2022-11-02 17:41:14,215 - Reading Coords - INFO - syri:135 - setting CP005572.1 as BK006947.3
2022-11-02 17:41:14,215 - Reading Coords - INFO - syri:135 - setting CP005666.2 as BK006948.2
2022-11-02 17:41:14,215 - Reading Coords - INFO - syri:135 - setting CP006197.2 as BK006949.2
2022-11-02 17:41:14,345 - syri - INFO - syri:214 - starting
2022-11-02 17:41:14,346 - syri - INFO - syri:214 - Analysing chromosomes: ['BK006934.2', 'BK006935.2', 'BK006936.2', 'BK006937.2', 'BK006938.2', 'BK006939.2', 'BK006940.2', 'BK006941.2', 'BK006942.2', 'BK006943.2', 'BK006944.2', 'BK006945.2', 'BK006946.2', 'BK006947.3', 'BK006948.2', 'BK006949.2']
2022-11-02 17:41:15,965 - getCTX - INFO - syri:214 - Identifying cross-chromosomal translocation and duplication for chromosome2022-11-02 17:41:15.965046
2022-11-02 17:41:19,753 - local_variation - INFO - syri:225 - Finding SVs in synOut.txt, invOut.txt, TLOut.txt, invTLOut.txt, ctxOut.txt
2022-11-02 17:41:20,542 - local_variation - INFO - syri:245 - Finding SNPs and small indels

Any idea about how this occurred?

Thanks!

mnshgl0110 commented 1 year ago

I think this is caused because syri cannot run show-snps from mummer. Could you please check that show-snps is in PATH? You can also try to run syri with the -s parameter.

xtan1221 commented 1 year ago

@mnshgl0110 show-snps of MUMmer was installed and can be run from anywhere. I also directly run it for testing:

show-snps out.filtered.delta >test-show-snps.txt

and the output file was successfully generated without any error:

-rw-r--r--  1 tan  staff    12M Nov  2 17:27 GCA_000146045.2_R64_genomic.fna
-rw-r--r--  1 tan  staff    12M Nov  2 17:27 GCA_000977955.2_Sc_YJM1447_v1_genomic.fna
-rw-r--r--  1 tan  staff    12M Nov  2 17:27 GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.filtered
-rw-r--r--  1 tan  staff   684B Nov  2 17:41 TLOut.txt
-rw-r--r--  1 tan  staff    31K Nov  2 17:41 ctxOut.txt
-rw-r--r--  1 tan  staff   6.7K Nov  2 17:41 dupOut.txt
-rw-r--r--  1 tan  staff   728B Nov  2 17:41 invDupOut.txt
-rw-r--r--  1 tan  staff   274B Nov  2 17:41 invOut.txt
-rw-r--r--  1 tan  staff   838B Nov  2 17:41 invTLOut.txt
-rw-r--r--  1 tan  staff   352B Nov  2 17:41 mapids.txt
-rw-r--r--  1 tan  staff    10K Nov  2 17:41 notAligned.txt
-rw-r--r--  1 tan  staff   487K Nov  2 17:29 out.delta
-rw-r--r--  1 tan  staff    67K Nov  2 17:29 out.filtered.coords
-rw-r--r--  1 tan  staff   159K Nov  2 17:29 out.filtered.delta
lrwxr-xr-x  1 tan  staff    50B Nov  2 17:28 qrygenome@ -> GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.filtered
lrwxr-xr-x  1 tan  staff    31B Nov  2 17:27 refgenome@ -> GCA_000146045.2_R64_genomic.fna
-rw-r--r--  1 tan  staff     0B Nov  2 17:41 snps_init.txt
-rw-r--r--  1 tan  staff    95K Nov  2 17:41 sv.txt
-rw-r--r--  1 tan  staff    19K Nov  2 17:41 synOut.txt
-rw-r--r--  1 tan  staff     0B Nov  3 17:04 syri.log
-rw-r--r--  1 tan  staff    13M Nov  3 17:07 test-show-snps.txt

So I guess it should not be the show-snps causing the problem?

mnshgl0110 commented 1 year ago

Syri calls show-snps and saves the output in snps_init.txt file. Later, it reads the file and selects variants for each chromosome. Currently, the snps_init.txt file is empty (it should not be), suggesting that the reported error is happening when syri tries to get variants for chromosomes from it.

Did you also try with the -s parameter?

This could also be a Mac issue. Currently, syri starts a subprocess to run show-snps and I am wondering could it be possible that mac isn't happy with that. If possible, could you please try to run syri on linux?

Alternatively, you can use BAM/PAF files as input, then syri would not use show-snps and probably you would not get the error.

xtan1221 commented 1 year ago

I have run SyRI on linux and there is no error occurred. So I think this should be a MacOS issue. Thanks for the help!