Header reformatting fails

mptrsen commented 2 years ago

Header reformatting fails on a genome where the sequences are named sequentially, with sub-scaffolds such as:

scaffold_1
scaffold_2
scaffold_3
scaffold_4_3
scaffold_4_2
scaffold_4_1
scaffold_5
scaffold_6
scaffold_7_2
scaffold_7_1
[...]
scaffold_150_2
scaffold_150_1
scaffold_151
scaffold_152_3
scaffold_152_2
scaffold_152_1
[...]

The second round of reformatting only considers numbers in \$_=">\$1" if /([0-9]+)/ (line 329 of EDTA.pl) and produces duplicates for these subscaffolds.

I suggest to rename sequences with ascending numbers and create a dictionary (hash in Perl) that maps these numbers to the original names.

Out of interest: Why is it necessary to reformat the headers in the first place? Which program can not deal with headers longer than 13 characters? Would it be an option to increase this maximum?

oushujun commented 2 years ago

Hello @mptrsen,

The best practice is to reformat sequence IDs before running ANY analyses as suggested in Readme. The reformatting module in EDTA is just foolproof in case users don't reformat beforehand. For your case, you may replace "scaffold" with "scf" to achieve the goal.

The sequence ID length is restricted by RepeatMasker. You may read more in #239. Creating a dictionary will not help because complicated sequence IDs (i.e., spaces, special characters) will fail the GFF format.

Best, Shujun

mptrsen commented 2 years ago

Ok, thank you. Since it was not phrased as a requirement in the Readme and EDTA does try to shorten the headers, I paid the header length no further consideration until I ran into this issue. With pre-shortened headers, my dataset runs through fine. It's unfortunate that RepeatMasker can not deal with long headers, but that's the way it is.

arslan9732 commented 1 year ago

Use seqtk to replace headers:

seqtk rename genome_raw.fa seq | sed -r 's/^(>\S+)\s.*/\1/' >genome.fa

oushujun / EDTA

Header reformatting fails #283