ryanmelnyk / PyParanoid

Rapid and scalable homolog identification for bacterial genomes
MIT License
32 stars 7 forks source link

IdentifyOrthologs.py fails... strains from PropagateGroups.py not in matrix #7

Open Bjartur108 opened 6 years ago

Bjartur108 commented 6 years ago

When running IdentifyOrthologs.py with the default settings, I get the following message in the output:

Parsing matrix to identify orthologs... A_aeriphila_YT10 not found in matrix. Check strainlist.

Taking a look at homolog_matrix.txt, it appears that none of the strains I ran PropagateGroups.py on are in the matrix (or in locustag_matrix.txt), only those from the initial BuildGroups.py command.

The error file includes the following (including the versions of HMMER and MCL in case they are relevant):

Module MCL 12.068 Loaded. Module HMMER 3.1b2 Loaded. Traceback (most recent call last): File "/home/okra108/.conda/envs/pyparanoid/bin/IdentifyOrthologs.py", line 252, in main() File "/home/okra108/.conda/envs/pyparanoid/bin/IdentifyOrthologs.py", line 225, in main orthos = parse_threshold_matrix(args.threshold,strains) File "/home/okra108/.conda/envs/pyparanoid/bin/IdentifyOrthologs.py", line 59, in parse_threshold_matrix strain_vals = [vals[i] for i in indices UnboundLocalError: local variable 'indices' referenced before assignment

Do you know what the issue could be here? I didn't get any errors when I ran PropagateGroups.py, but I see that the prop_homologfaa outdir is empty (the other prop* outdirs all have files for each strain).

Thanks, Nick

ryanmelnyk commented 6 years ago

Do you have error/output messages for PropagateGroups.py? Also, does all_groups.faa exist in your output folder and contain protein sequences?

Bjartur108 commented 6 years ago

Yes, all_groups.faa exists and contains protein sequences.

However, it looks like I was wrong- PropagateGroups.py did not complete successfully when I ran it again.

Output:

Making diamond databases for 29 strains... 20 remaining... 10 remaining... Done! Running diamond on all 29 strains... 20 remaining... 10 remaining... Done! Getting gene lengths... Parsing diamond results... 80 remaining... 70 remaining... 60 remaining... 50 remaining... 40 remaining... 30 remaining... 20 remaining... 10 remaining... Done! Running inparanoid on 29 strains...

The error message:

[mclIO] read native interchange 47308x4122 matrix with 47308 entries Error: Incomplete database file. Database building did not complete successfully. Error: Error detecting input file format. First line seems to be blank. Error: Incomplete database file. Database building did not complete successfully. Blast output file A->B is missing Traceback (most recent call last): File "/home/okra108/.conda/envs/pyparanoid/bin/PropagateGroups.py", line 224, in main() File "/home/okra108/.conda/envs/pyparanoid/bin/PropagateGroups.py", line 216, in main group_members = parse_inparanoid(new_strains) File "/home/okra108/.conda/envs/pyparanoid/bin/PropagateGroups.py", line 162, in parse_inparanoid for line in open(os.path.join(outdir,"prop_paranoid_output","{}.CONSENSUS.txt".format(s))): IOError: [Errno 2] No such file or directory: '/home/okra108/Comp_Bif_genomes/Pangenomics/bif_pyp/prop_paranoid_output/B_pseudocatenulatum_X.CONSENSUS.txt'

The underlying issue is that one of the fasta files in my database was blank. I had inadvertently included a placeholder file for a strain I needed to annotate. After removing that strain, PropagateGroups.py ran without error, and I could easily add the strain subsequently by running the command again after my annotation finished and I created another strainlist file.

Afterwards, I ran IdentifyOrthologs.py without issue.

I think the reason I didn't catch this error may be that when I deleted the prop_ outdirs to fix minor errors and rerun PropagateGroups.py the first time, I missed some of the output of PropagateGroups.py. Instead of the error I quoted above, I got a simple "check your strain list" message in the output file that didn't make it clear where the issue was. I saw the files in the prop outdirs and assumed the script had completed since I didn't get an overt error message (the error file just printed a list of the modules loaded, as it does when the job completes successfully).

Module slurm/17.11.5 loaded Module openmpi/3.0.1 loaded Module gcc/7.3.0 loaded Module bio/1.0 loaded Module bio/1.0 loaded Module bio3/1.0 loaded Module MCL 12.068 Loaded. Module HMMER 3.1b2 Loaded.

I think a possible area of improvement here would be preventing strainlist messages (which do not cause the script to halt even with set -e specified) from blocking/subsuming other output/error messages that are more clearly linked to the underlying issue.

Best, Nick