nigyta / dfast_core

DDBJ Fast Annotation and Submission Tool
77 stars 14 forks source link

Pseudogenes #34

Open vappiah opened 3 years ago

vappiah commented 3 years ago

Does Dfast identify pseudogenes?

nigyta commented 3 years ago

Yes. It detects internal stop codons and frameshift. Please see the note tags in GFF or GenBank format.

/note="Partial hit; WP_003643223.1 gluconate permease
      (Lactobacillus plantarum WCFS1) [pid:71.3%, q_cov:100.0%,
      s_cov:44.8%, Eval:6.5e-78]"
/note="frameshifted; deletion at around 14464"
vappiah commented 3 years ago

Thanks. I have a bacterial sequence (M. ulclerans) which I am annotating. Pseudogene was detected

_/note="PseudoGeneDetection:WP013225067.1 formate

Is there anyway for DFAST to give how many of the pseudogenes are present ?

PS. Sequencing was done with Oxford Nanopore

nigyta commented 3 years ago

Please check the log file (application.log) for the number of pseudogenes. It says as following:

2020/11/06 21:04:09 6 CDS features were marked as possible pseudo due to internal stop codons.
2020/11/06 21:04:09 25 CDS features were marked as possible pseudo due to frameshift.

I think it is helpful for finding indel errors of Nanopore assembly. If an annotated genome from the same species is available in a gbk format, it is recommendable to provide it as a reference using the --reference option. For example.

dfast -g  query.genome.fna --reference ref.genome.gbk
ireneortega commented 3 years ago

Is there a way to extract pseudogenes from DFAST annotation to create a unique file containing all of them? I'm interesting in identifying pseudogenes location and its putative product/function.

nigyta commented 3 years ago

@ireneortega Well, currently there's no straightforward way to do that. I would write a script to parse a .gbk file and extract such location.

nick-youngblut commented 2 years ago

It would definitely be helpful to have a standard table output instead of:

>Feature sequence001
3285    4730    gene
                        locus_tag       LOCUS_00010
3285    4730    CDS
                        product membrane protein
                        inference       COORDINATES:ab initio prediction:MetaGeneAnnotator
                        inference       similar to AA sequence:RefSeq:WP_008760507.1
                        protein_id      gnl|my_center|LOCUS_00010
4743    5585    gene
                        locus_tag       LOCUS_00020
4743    5585    CDS
                        product hypothetical protein
                        inference       COORDINATES:ab initio prediction:MetaGeneAnnotator
                        protein_id      gnl|my_center|LOCUS_00020
5516    5579    assembly_gap
                        estimated_length        known
                        gap_type        within scaffold
                        linkage_evidence        paired-ends
5674    5922    gene
                        locus_tag       LOCUS_00030
nigyta commented 2 years ago

@nick-youngblut @ireneortega @vappiah As of ver.1.2.15, DFAST generates the summary file for pseudogenes. Here is an example: https://dfast.ddbj.nig.ac.jp/analysis/download/54a8ce11-3b1c-4ccb-8644-b316c7c60bf9/pseudogene_summary.tsv (not a parmanent link, will be deleted in a month)

Please have a look.

ireneortega commented 2 years ago

@nigyta Ok, the output is good as the heading are perfect, but I am confused with the first result. I understand there was a deletion before the CDS (17027..17800 (-)) that caused the protein to be translated further and the appearance of a new stop codon that made the protein shorter. Please correct me if I am wrong.

I would suggest you more things to add if possible: – The number of nucleotides inserted or deleted in brackets in columns "insertion" and "deletion". – The length of the pseudogene and the length of the protein described in "ref_id" just to compare the wild type sequence and the truncated one.

Otherwise, the output is enough for me. Good job and many thanks!

nigyta commented 2 years ago

Your understanding is correct. In this case, the frameshift mutation and stop codon mutation seem to have occurred independently. DFAST reports all possible mutations that may cause frameshift or inframe stop codon, so both the two results appeared in the file.

As for the suggestions, I agree that they are very informative, but it is difficult to count the precise number of insertion/deletion in the current implementation. This is because DFAST attempts to find insertion/deletion by aligning the translated nucleotide sequence to reference amino acid sequences. For example, a 2-nucleotide insertion can be called as a 1-nucleotide deletion by inserting 1 amino acid residue to the reference sequence, which might also depend on how the reference sequence is close to the query. Adding the lengths of query and reference sequences is easy. However, DFAST does not use 'wild type' in a strict sense, but uses a reference sequence from the close relatives. If it's okay, I will add the lengths in the output. Incidentally, alignment coverage against query and reference lengths are already included in the file.

Also, maybe in the future version, I should merge insertion and deletion columns in the summary file into a single column, since it is difficult to distingish insertion and deletion for the reason described above. I noticed this while considering your suggestion. Thank you for your feedback.

ireneortega commented 2 years ago

I meant "reference" sequence by "wild type", sorry for the misunderstanding. As you said, coverage is already included, so there is no need in adding the lengths. It's up to you.

I understand your point of view. Knowing what deletion or insertion occurred it's difficult, so maybe merging insertion and deletion columns into a single one called indels would be a better idea.

But I still believe that if you say that reference amino acid sequences are what are used to annotate genomes, that's enough to call indels, regardless of the evolutive distance to the query. People then will consider that information useful or not, but the result will always be correct. You should encounter the same problem when annotating queries that are not close to reference sequence. Anyway, I don't belive the aim of DFAST is to identify point mutations. You could just add the cause of the pseudogen in a single column: frameshift, indel, stop codon with the bases affected as your already did.

Just suggestions. Come on!