oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
315 stars 70 forks source link

ERROR in TE annotation stats #439

Open shiyi-pan opened 3 months ago

shiyi-pan commented 3 months ago

Hi, oushujun, thank you for develop this great tool for genome repeat annotation. I want to use EDTA to annotate my genome and met an error.

I install EDTA v2.1.3 by mamba with following script ( I can't install the latest version for server configuration): mamba env create -f EDTA.yml -p /gss1/home/ruanjian/EDTA Here is the script used to annotate my genome: perl /gss1/home//c.annotation/a.TEs_annotation/EDTA/EDTA.pl --genome long.fa --species others --step all --overwrite 1 --threads 16 --sensitive 1 --anno 1 --evaluate 1

Here is the error I met:

Massive resequencing efforts have been undertaken to catalog allelic variants in major crop species including soybean, but the scope of the information for genetic variation often depends on short sequence reads mapped to the extant reference genome. Additional de novo assembled genome sequences provide a unique opportunity to explore a dispensable genome fraction in the pan-genome of a species. Here, we report the de novo assembly and annotation of Hwangkeum, a popular soybean cultivar in Korea. The assembly was constructed using PromethION nanopore sequencing data and two genetic maps and was then error-corrected using Illumina short-reads and PacBio SMRT reads. The 933.12Mb assembly was annotated as containing 79,870 transcripts for 58,550 genes using RNA-Seq data and the public soybean annotation set. Comparison of the Hwangkeum assembly with the Williams 82 soybean reference genome sequence (Wm82.a2.v1) revealed 1.8 million single-nucleotide polymorphisms, 0.5 million indels, and 25 thousand putative structural variants. However, there was no natural megabase-scale chromosomal rearrangement. Incidentally, by adding two novel subfamilies, we found that soybean contains four clearly separated subfamilies of centromeric satellite repeats. Analyses of satellite repeats and gene content suggested that the Hwangkeum assembly is a high-quality assembly. This was further supported by comparison of the marker arrangement of anthocyanin biosynthesis genes and of gene arrangement at the Rsv3 locus. Therefore, the results indicate that the de novo assembly of Hwangkeum is a valuable additional reference genome resource for characterizing traits for the

GFF> line 7. Use of uninitialized value $extra in substitution (s///) at /gss1/home//c.annotation/a.TEs_annotation/EDTA/util/gff2bed.pl line 101, line 7. Use of uninitialized value $extra in pattern match (m//) at /gss1/home//c.annotation/a.TEs_annotation/EDTA/util/gff2bed.pl line 102, line 7. Use of uninitialized value $element_end in concatenation (.) or string at /gss1/home//c.annotation/a.TEs_annotation/EDTA/util/gff2bed.pl line 110, line 7. Use of uninitialized value $TE_class in concatenation (.) or string at /gss1/home//c.annotation/a.TEs_annotation/EDTA/util/gff2bed.pl line 110, line 7. Use of uninitialized value $method in concatenation (.) or string at /gss1/home//c.annotation/a.TEs_annotation/EDTA/util/gff2bed.pl line 110, line 7. Use of uninitialized value $score in concatenation (.) or string at /gss1/home//c.annotation/a.TEs_annotation/EDTA/util/gff2bed.pl line 110, line 7. Use of uninitialized value $strand in concatenation (.) or string at /gss1/home//c.annotation/a.TEs_annotation/EDTA/util/gff2bed.pl line 110, line 7. Use of uninitialized value $phase in concatenation (.) or string at /gss1/home//c.annotation/a.TEs_annotation/EDTA/util/gff2bed.pl line 110, line 7. Use of uninitialized value $type in concatenation (.) or string at /gss1/home//c.annotation/a.TEs_annotation/EDTA/util/gff2bed.pl line 110, line 7. Argument "Binary:matches.." isn't numeric in numeric gt (>) at /gss1/home//c.annotation/a.TEs_annotation/EDTA/util/split_overlap.pl line 26, line 1. Argument "matches" isn't numeric in numeric gt (>) at /gss1/home//c.annotation/a.TEs_annotation/EDTA/util/split_overlap.pl line 26, line 1. Warning: LOC list - is empty.

Count all-versus-all misclassifications using the cleanup_nested.pl .stat file perl count_nested.pl -in sequence.fa.stat -cat [redun|nested|all] > sequence.fa.stat.sum

Count all-versus-all misclassifications using the cleanup_nested.pl .stat file perl count_nested.pl -in sequence.fa.stat -cat [redun|nested|all] > sequence.fa.stat.sum

Count all-versus-all misclassifications using the cleanup_nested.pl .stat file perl count_nested.pl -in sequence.fa.stat -cat [redun|nested|all] > sequence.fa.stat.sum ERROR: TE annotation stats results not found in long.fa.mod.EDTA.TE.fa.stat!

Could you help me fix this problem, thank you very much.

oushujun commented 3 months ago

Hi,

Sorry for the delay. Your error message seems truncated. Please double check your genome file or provide a more complete program output.

Thanks! Shujun

shiyi-pan commented 3 months ago

Thank you for your reply, oushujun. Can you tell me from which aspects to examine the genome file? I have some short contigs on my genome, does that affect how EDTA works?

oushujun commented 3 months ago

Your error message seems to contain an abstract, which should not happen if your genome file is what it is meant to be. You may want to check if it's the correct file or if the sequence names are simple.

Shujun

shiyi-pan commented 3 months ago

I'm sorry to bother you again, Shujun. I'm not sure what's the specific meaning of "abstract". The sequence name of my genome file looks like this: RagTag_0001,RagTag_0002...... RagTag_1695. My genome file looks like normal fasta format file,the sequence consists of four base types ATCG and ambiguous base N.

Thank you again, Shujun.

oushujun commented 3 months ago

This is the error message in your initial post:

image

I don't understand why EDTA would spit out an abstract-like paragraph in its error message...

From your last reply, it seems that your genome file is ok. Please update your EDTA to 2.2.1 and try again.

Shujun

shiyi-pan commented 3 months ago

Thank you for your reply, oushujun. I'm sorry for my careless. I tried to copy some normal log content before the error but somehow copied the paper I was reading.

I update my EDTA and met a problem too. Here is the error messages:

Species: others find: ‘./TIR-Learner-+-TIRvish.gff3’: No such file or directory

unknown/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation. unknown/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation. unknown/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation. SINE/U not found in the TE_SO database, it will not be used to rename sequences in the final annotation. Sun Mar 17 11:43:02 CST 2024 Homology-based annotation of TEs using formated.ragtag.scaffold.fasta.mod.EDTA.TElib.fa from scratch.

Warning: SINE/U not found in the TE_SO database, will use the general term 'repeat_region SO:0000657' to replace it.

Warning: SINE/U not found in the TE_SO database, will use the general term 'repeat_region SO:0000657' to replace it.

The final TEanno.sum file doesn't have the SINE class.

image

Thank you again.

oushujun commented 3 months ago

what command did you use? Thanks!

shiyi-pan commented 3 months ago

Thank you, Shujun. Here is my command:

mamba activate edta

perl EDTA.pl --genome formated.ragtag.scaffold.fasta --species others --step all --overwrite 1 --threads 8 --sensitive 1 --anno 1 --evaluate 1

By the way, I find there are two TE_Sequence_Ontology.txt file in EDTA with different file size:

image

Do I need to unify the content of two files? If need, which one is better? Thank you again.

oushujun commented 3 months ago

The EDTA version should be fine. You need to update EDTA to the latest version.

Shujun