Do I need to “train” one by one?

nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline

http://funannotate.readthedocs.io

BSD 2-Clause "Simplified" License

320 stars 85 forks source link

Do I need to “train” one by one? #901

Open maruiqi0710 opened 1 year ago

maruiqi0710 commented 1 year ago

If I need to annotate many assemblies of the same species, do I need to train them one by one? Can I reuse previous training results? In other words, if I use “funannotate species -s new_species_name -a new_species_name.parameters.json” after predict step, then I want to annotate another strain of the same species，need I use RNA data by" funannotate train "again? P.S. These RNA data all come from NCBI. They are the same species as the strain I am annotating, but it's different from the strain I'm annotating.

hyphaltip commented 1 year ago

yes you can reuse -- we did this for a pangenome projects - but I found it was helpful to still include the rna alignments for the predict steps to further support exons but you don't need to do the full training step. I ended up providing the trinity transcripts as input the rest of the training steps after running a good training model on a single strain.

maruiqi0710 commented 1 year ago

yes you can reuse -- we did this for a pangenome projects - but I found it was helpful to still include the rna alignments for the predict steps to further support exons but you don't need to do the full training step. I ended up providing the trinity transcripts as input the rest of the training steps after running a good training model on a single strain.

Thanks for your reply. When I annotate the second strain, I use the following command: clean –> sort –> mask –>

funannotate train \ -i second_strain.fa -o fun \ --species "XXX" \ --strain second_strain \ --left NCBI_RNA1.fq.gz \ --right NCBI_RNA2.fq.gz \ --stranded RF --jaccard_clip \ --cpus 12

funannotate predict \ -i second_strain.fa -o output_folder \ --strain "XXX" \ -- stain second_strain \ --augustus_species XXX-first_strain

Will the time to run prediction be shortened?

hyphaltip commented 1 year ago

you can just try and see? you don't need to do the train on the second strain if you don't want to - you would only run predict but you would give the model of the first trained strain as the training input - though if you want to run coding quarry you would still provide rnaseq as input to predict. I am not totally sure where/what you save in the end so you'll need to do it empirically.

nextgenusfs commented 1 year ago

It will run faster if preexisting training sets exist but really only because BUSCO isn't run to generate a training set. If isolates of the same species than for sure I would train on the best genome and then use that training set to predict the others. If not the same species than the guidance is less obvious on what to do.