schneebergerlab / msyd

MIT License
9 stars 0 forks source link

Issues while anaysing highly similar genomes #6

Open mnshgl0110 opened 1 year ago

mnshgl0110 commented 1 year ago

I think there is some incompleteness in the pansyri.pansyn.find_overlaps as it is giving me error when I try to get pansyntenic region with two highly similar (actually simulated) query genomes. The files are here: /srv/netscratch/dep_mercier/grp_schneeberger/projects/syri2/results/human/simulatedgenomes/chr22

syns, alns = util.parse_input_tsv('genomes.tsv')
df = util.coresyn_from_lists(syns, alns, SYNAL=False)
Traceback (most recent call last):
  File "/srv/netscratch/dep_mercier/grp_schneeberger/software/anaconda3_2021/envs/mgpy3.8/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3398, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-59-90a0de3ea250>", line 1, in <cell line: 1>
    df = util.coresyn_from_lists(syns, alns, SYNAL=False)
  File "pansyri/pyxfiles/util.pyx", line 75, in pansyri.util.coresyn_from_lists
  File "pansyri/pyxfiles/pansyn.pyx", line 262, in pansyri.pansyn.find_multisyn
  File "pansyri/pyxfiles/pansyn.pyx", line 132, in pansyri.pansyn.find_overlaps
  File "pansyri/pyxfiles/util.pyx", line 85, in pansyri.util.get_orgs_from_df
TypeError: reduce() of empty sequence with no initial value

We can discuss it when you have some time.

lrauschning commented 1 year ago

(reposting here, as replying to the email seems not to have worked)

Hi Manish, I got the same error message yesterday when I was fixing an issue related to the parallelization, which I discovered while benchmarking. The error is fixed now and when I tried with the latest commit in the repo (branch leon, now also merged to master), pansyri -i genomes.tsv --sp --syn did not throw an error. Let me know if there are still issues when running the current version! Cheers, Leon

mnshgl0110 commented 1 year ago

Hi Leon, So, this is the current status:

import pansyri.util as util
from pansyri.pansyn import find_multisyn
syns, alns = util.parse_input_tsv('genomes.tsv')
df = util.coresyn_from_lists(syns, alns, SYNAL=False) # Does not work
df = find_multisyn(syns, alns, SYNAL=False) # Works but give crosssyn as well
df = find_multisyn(syns, alns, SYNAL=False, only_core=True) # Does not work

We need to ensure that this is working for all use cases.

mnshgl0110 commented 1 year ago

It seems that this issue is caused when pansyri does not like the input file names in the genomes.tsv, specifically how the bam/syri.out files are named.

lrauschning commented 1 year ago

I can reproduce the error. It's weird that this only arises when calling core synteny. On the ampril dataset, all combinations work. I'll look more into this later.

lrauschning commented 1 year ago

Okay, I think i might have fixed what is happening in c565b85. There was still some code specific to testing on the ampril dataset in there that was also causing some other issues.

mnshgl0110 commented 1 year ago

Earlier, it seemed to be working when the filenames were ref_qry1.bam' andref_qry2.bam`, but not when they were something else. Were you able to reproduce and possibly fix that?

lrauschning commented 1 year ago

I think this commit should fix the need for this filename format (it was hardcoded to match the names in the Ampril dataset). I'll try to reproduce it and see if normal naming works in the next few days.

lrauschning commented 1 year ago

Ah, sorry I forgot to test it again after the commit. My account for the HPC at Cologne is expired now, I'll test it again when I get the account renewed. Testing locally, everything works on the ampril dataset, but that's not really a surprise.