hillerlab / make_lastz_chains

Portable solution to generate genome alignment chains using lastz
MIT License
44 stars 8 forks source link

Error: detected space-or-tab-containing sequence #63

Closed ecbaker7-tamu closed 2 months ago

ecbaker7-tamu commented 2 months ago

Hi, I am trying to run this pipeline through Anaconda and I am getting the following error:

_Error! File: data/Nitens-and-Greg/GCF_021461395.2_iqSchAmer2.1_genomic.fna - detected space-or-tab-containing sequence: NC060119.1 Schistocerca americana isolate TAMUIC-IGC-003095 chromosome 1, iqSchAmer2.1, whole genome shotgun sequence Please exclude or fix sequences with spaces and tabs.

From what I can tell, it is a problem with the formatting of the header in the fasta file as suggested in the GitHub. I have seen the script to rename after running but there is no script to rename before. do you have a script I can use or a suggestion on how to fix this error? I have included a screenshot of the full run below.

I am trying to run this on very large RefSeq genomes (~8Gb) and I would appreciate any help you can offer!

Screenshot 2024-06-18 163252

MaevaTecher commented 2 months ago

Hi, @MichaelHiller and @kirilenkobm. I just wanted to add details to this request. We understand this is probably an issue because we use the chromosome with "NCXXX.Y," specified in issue #8 #3 and added in the README. Still, we would appreciate knowing what is the best method to rename our input files and being able to use your custom scripts to rename afterward standalone_scripts/rename_chromosomes_back.py.

We want to use make_lastz_chain and TOGA to check whether the ortholog detection using RefSeq on some of our genomes could be improved compared to OrthoFinder, so we want to keep the GCF accession. I am sure the solution is very straightforward, and I am sorry for the silly question. Bioinformatics is hard, even after many years!! o(>.<)o

Thanks for your time!

ecbaker7-tamu commented 2 months ago

I believe I've corrected it myself, the code I used is below. and was saved into a .sh file

!/bin/bash

input_file="input-name.fna" output_file="input-name_only_contig_name.fna"

sed -e 's/^(>\S)./\1/' "$input_file" > "$output_file"

echo "Processing complete. Output written to $output_file."

MichaelHiller commented 2 months ago

Thx for sharing this. This is the easiest solution.