mikolmogorov / Ragout

Chromosome-level scaffolding using multiple references
Other
146 stars 27 forks source link

Ragout unable to read target fasta file #67

Closed rstewa03 closed 3 years ago

rstewa03 commented 3 years ago

I ran Ragout with the following command: /data/programs/Ragout_v2.3/bin/ragout Pieris_napi_fullAsm_Pmac000.1_polished.rcp --outdir out/ -t 30
and got the error:
[14:00:57] INFO: Starting Ragout v2.3 [14:00:57] INFO: Running withs synteny block sizes '[5000, 500, 100]' [14:00:57] WARNING: Using existing Sibelia results from previous run [14:00:57] WARNING: Use --overwrite to force alignment [14:00:57] INFO: Inferring phylogeny from synteny blocks data [14:00:57] INFO: Reading /mnt/griffin/racste/P_macdunnoughii/Pmac_genome_versions/pseudochromosomal/out/sibelia-workdir/100/blocks_coords.txt
[14:01:00] ERROR: An error occured while running Ragout: [14:01:00] ERROR: No sequences read for genome Pmac000.1_polished. Check recipe for correctness.

My recipe file is: .references = Pnapi .target = Pmac000.1_polished

Pnapi.fasta = /mnt/griffin/racste/P_macdunnoughii/Pmac_genome_versions/faDnaPolishing/masking/Pieris_napi_fullAsm.cleaned.hardmasked.fa Pmac000.1_polished.fasta = ./Pmac000.1_polished.fasta

My working directory contains the file Pmac000.1_polished.fasta.

Can you advise me on how I might proceed?

mikolmogorov commented 3 years ago

Hi,

It is very likely that your target genome is too divergent from the reference, and there were no synteny found between the genomes. In some cases, lowering synteny block size might help, as it is described in the manual. If you could post the ragout.log file - I should be able to give more advice.

rstewa03 commented 3 years ago

Thanks for the quick response. The two species are sister taxa and can hybridize in the wild. They are eukaryotes, so the genomes are large (320-350M bases), which I know Sibelia has difficulty processing (leading to my other SibeliaZ question). Sibelia took a while, but was able to run using the reference genome. The recipe was:

.references = Pnapi .target = Pmac000.1_polished

Pnapi.fasta = /mnt/griffin/racste/P_macdunnoughii/Pmac_genome_versions/faDnaPolishing/masking/Pieris_napi_fullAsm.cleaned.hardmasked.fa Pmac000.1_polished.fasta = ./Pmac000.1_polished.fasta

and the contents of the ragout.log was:

[14:13:59] root: INFO: Starting Ragout v2.3 [14:13:59] root: INFO: Running withs synteny block sizes '[5000, 500, 100]' [14:13:59] root: WARNING: Using existing Sibelia results from previous run [14:13:59] root: WARNING: Use --overwrite to force alignment [14:13:59] root: INFO: Inferring phylogeny from synteny blocks data [14:13:59] root: INFO: Reading /mnt/griffin/racste/P_macdunnoughii/Pmac_genome_versions/pseudochromosomal/out/sibelia-workdir/100/blocks_coords.txt [14:14:01] root: ERROR: An error occured while running Ragout: [14:14:01] root: ERROR: No sequences read for genome Pmac000.1_polished. Check recipe for correctness.

Unfortunately, I don't have the log from the first time I ran it, which included the sibelia portion, but all I did in the meantime was to copy the target fasta into the working directory, rather than using the path to the draft genome directory.

Do you think it could be a simple issue with the reference naming convention, which includes a '.'? Would this conflict with the genome.scaffold format that ragout depends on to process the blocks_coords.txt. Noncompatible naming was an issue I ran into when trying to use the .maf from SibeliaZ, which I solved by renaming the fasta headers (fasta headers are now e.g. Pmacpol.Sc0000001) and simplifying the name of the target genome to Pmacpol (from Pmac000.1_polished) in the recipe.

mikolmogorov commented 3 years ago

Thanks!

As I understand removing the dots from genome names have fixed the issue for SibeliaZ run? If so, it is likely the same issue could have affected the Sibelia run originally reported in this thread.

If SibeliaZ has worked for you, I don't think there is need to try make Sibelia work. In general, Sibelia was designed for very close bacterial genomes (e.g., a few % divergence). Eukaryotes with more non-coding sequence, higher repeat content and potentially lower similarity are a challenge. My first recommendation for the larger genomes is Cactus, but is is somewhat more challenging to run, compared to SibeliaZ.

Mikhail