rrwick / Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
GNU General Public License v3.0
306 stars 28 forks source link

FileNotFoundError during `trycycler cluster` #11

Closed aruginkgo closed 3 years ago

aruginkgo commented 3 years ago

For some reason I get FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpz1qeqdno/A_assemblies/canu_0_pos.fasta' when running trycycler cluster during the distance matrix part. It seems like the temp directory is being made but not the A_assemblies directory inside that.

I think I am using the latest version of Trycycler (that is to say, I python3 setup.py install'd in a directory called Trycycler-0.4.2 but the version.py in that is still 0.4.1)

I was able to Trycycle a different set of assemblies so it might be something on my end. I can't share the sequences unfortunately but I can try to see if I can get a reproducible example going.

Building distance matrix (2021-01-20 19:08:34)
    Mash is used to build a distance matrix of all contigs in the assemblies.

Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/trycycler", line 11, in <module>
    load_entry_point('Trycycler==0.4.1', 'console_scripts', 'trycycler')()
  File "/home/ubuntu/.local/lib/python3.7/site-packages/Trycycler-0.4.1-py3.7.egg/trycycler/__main__.py", line 40, in main
    cluster(args)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/Trycycler-0.4.1-py3.7.egg/trycycler/cluster.py", line 41, in cluster
    matrix = distance_matrix(seqs, seq_names, args.distance)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/Trycycler-0.4.1-py3.7.egg/trycycler/cluster.py", line 232, in distance_matrix
    mash_matrix = get_mash_dist_matrix(seq_names, seqs, distance, indent=False)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/Trycycler-0.4.1-py3.7.egg/trycycler/mash.py", line 28, in get_mash_dist_matrix
    pos_sketches, neg_sketches = make_mash_sketches(seq_names, seqs, temp_dir)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/Trycycler-0.4.1-py3.7.egg/trycycler/mash.py", line 63, in make_mash_sketches
    write_seq_to_fasta(seq_pos, seq_name, fasta_pos)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/Trycycler-0.4.1-py3.7.egg/trycycler/misc.py", line 155, in write_seq_to_fasta
    with open(filename, 'wt') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp6f7zakqj/A_assemblies/canu_0_pos.fasta'
aruginkgo commented 3 years ago

for what it's worth, I threw in

from pathlib import Path
os.makedirs(temp_dir / Path(seq_name).parent, exist_ok=True)

in make_mash_sketches just after fasta_pos and fasta_neg and it finished the distance matrix part then crashed again at clustering with a similar issue:

cluster/cluster_001/1_contigs:
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/trycycler", line 11, in <module>
    load_entry_point('Trycycler==0.4.1', 'console_scripts', 'trycycler')()
  File "/home/ubuntu/.local/lib/python3.7/site-packages/Trycycler-0.4.1-py3.7.egg/trycycler/__main__.py", line 40, in main
    cluster(args)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/Trycycler-0.4.1-py3.7.egg/trycycler/cluster.py", line 42, in cluster
    cluster_numbers = complete_linkage(seqs, seq_names, depths, matrix, args.distance, args.out_dir)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/Trycycler-0.4.1-py3.7.egg/trycycler/cluster.py", line 325, in complete_linkage
    with open(seq_fasta, 'wt') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'cluster/cluster_001/1_contigs/A_assemblies/canu_0.fasta'

with 1_contigs being created but empty

edit: with the hacky "fix" (big air quotes)

os.makedirs(cluster_dir / pathlib.Path(name).parent, exist_ok=True)
seq_fasta = cluster_dir / f'{pathlib.Path(name).stem}.fasta'

in cluster.py line ~324 in the loop not crashing and creating the final cluster_001/1_contigs/*_0.fasta but not sure why it's looking for the A_assemblies directory to begin with.

trycycler reconcile worked after that as well.

rrwick commented 3 years ago

Thanks for spotting this bug! If I understand correctly, one of your input assemblies has a contig named assemblies/canu_0. The slash is causing the problem, because the Trycycler cluster command saves contigs to a temporary file using their contig name as a filename. So it was trying to save /tmp/tmp6f7zakqj/A_assemblies/canu_0_pos.fasta, but the /tmp/tmp6f7zakqj/A_assemblies/ directory didn't exist because it was trying to save a file named A_assemblies/canu_0_pos.fasta.

I've taken the easy way out of this one and just made Trycycler check for slashes in contig names and quit with an error if they are there. That was easier than ensuring slash-containing contig names don't cause a crash :smile:

Also, thanks for pointing out the version number discrepancy! I've made a new version with the fix (v0.4.3), and now both GitHub and the code agree.