Closed mptrsen closed 2 years ago
Hello @mptrsen,
The best practice is to reformat sequence IDs before running ANY analyses as suggested in Readme. The reformatting module in EDTA is just foolproof in case users don't reformat beforehand. For your case, you may replace "scaffold" with "scf" to achieve the goal.
The sequence ID length is restricted by RepeatMasker. You may read more in #239. Creating a dictionary will not help because complicated sequence IDs (i.e., spaces, special characters) will fail the GFF format.
Best, Shujun
Ok, thank you. Since it was not phrased as a requirement in the Readme and EDTA does try to shorten the headers, I paid the header length no further consideration until I ran into this issue. With pre-shortened headers, my dataset runs through fine. It's unfortunate that RepeatMasker can not deal with long headers, but that's the way it is.
Use seqtk to replace headers:
seqtk rename genome_raw.fa seq | sed -r 's/^(>\S+)\s.*/\1/' >genome.fa
Header reformatting fails on a genome where the sequences are named sequentially, with sub-scaffolds such as:
The second round of reformatting only considers numbers in
\$_=">\$1" if /([0-9]+)/
(line 329 of EDTA.pl) and produces duplicates for these subscaffolds.I suggest to rename sequences with ascending numbers and create a dictionary (hash in Perl) that maps these numbers to the original names.
Out of interest: Why is it necessary to reformat the headers in the first place? Which program can not deal with headers longer than 13 characters? Would it be an option to increase this maximum?