Open junchen-deng opened 1 year ago
Hi Junchen,
Yes, it's also my understanding that Exonerate can't detect incomplete introns at the beginning or end of target contigs, as it's searching for 5' and 3' splice sites (I'm not sure if the intron model is more sophisticated than that, and searches for other intron motifs as well).
Can you upload the target file query sequence for Hemiptera3-395at7524
, along with the 395at7524_contigs.fasta
file from the gene directory? I'd like to do some testing to check for possible causes of your issue. At this stage it seems possible to me that it could (at least in some cases) be caused by incomplete introns, perhaps in combination with incorrect extension of the Exonerate alignment due to the use of the Exonerate option -refine
(as used by HybPiper). I might need to add a --no_refine
flag to hybpiper assemble
, and/or an option to ignore Exonerate target hits that contain stop codons (it'll have to be optional, I guess, as HybPiper will need to accommodate organisms that use alternative genetic codes where a canonical stop is translated as a standard amino acid).
Cheers,
Chris
Hi Chris,
Thanks for the reply! Here are the files you asked for.
Cheers, Junchen query.txt 395at7524_contigs.txt
Hi Junchen,
Thanks for those files. The 395at7524_contigs.txt
file doesn't contain the sequence NODE_1_length_2012_cov_8.473173
, though. Is that from a different sample? If so, can you send me the *_contigs.txt
file for that sample too?
With the first example you provided above (NODE_15_length_371_cov_2.302721
), it does indeed appear that the stop codons at the beginning of the alignment originate from incomplete intron sequence in the SPAdes assembly. Using the Hemiptera3-395at7524
protein as a query, I ran a tBLASTn search against the NCBI nucleotide database - the top hit was XM_039430175
, PREDICTED: Nilaparvata lugens mediator of RNA polymerase II transcription subunit 23 (LOC111051234), mRNA
.
I downloaded the corresponding region of the annotated genome (https://www.ncbi.nlm.nih.gov/nuccore/NC_052509.1?report=genbank&from=14296457&to=14358256&strand=true), and aligned it with the 'good' region of the Exonerate hit for NODE_15_length_371_cov_2.302721
, (i.e., the region immediately after the second stop codon, onward). Looking at the annotated genomic sequence, the CDS of this particular gene appears to be 4,170 bases, and the gene comprises 23 exons across 61.8 kb. As you can see in the alignments below (exons are annotated in yellow, Nilaparvata lugens genomic sequence top, NODE_15
bottom), the 'good' region of the NODE_15_length_371_cov_2.302721
Exonerate alignment corresponds to a single exon.
Wide view:
Zoomed view:
As mentioned, HybPiper 2 runs Exonerate with the flag --refine
by default, as this can produce much longer, correct alignments in some cases. However, in this case removing the --refine
flag did prevent Exonerate from including the incomplete intron sequence in the alignment, i.e., the NODE_15_length_371_cov_2.302721
hit now looks like this:
exonerate -m protein2dna --showalignment yes --showvulgar no -V 0 query.txt NODE_15.fasta --refine n
one
C4 Alignment:
------------
Query: Hemiptera3-395at7524
Target: NODE_15_length_371_cov_2.302721
Model: protein2dna:local
Raw score: 178
Query range: 953 -> 988
Target range: 174 -> 279
954 : ValPheAspIleValIleHisArgTyrLeuGluIleProProValThrLysSerLeuGluThrL : 975
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ValPheAspIleValIleHisArgTyrLeuGluIleProProValThrLysSerLeuGluThrL
175 : GTGTTCGACATAGTTATACACAGGTACTTGGAGATACCACCAGTCACAAAATCACTGGAGACAT : 238
976 : euLeuGluHisLeuGlyCysLeuTyrLysPheHisAspArg : 988
|||||||||||||||||||||||||||||||||||! !!:!
euLeuGluHisLeuGlyCysLeuTyrLysPheHisGlyLys
239 : TACTGGAACATTTAGGCTGTCTCTATAAATTTCATGGTAAG : 279
I'll have to have a think about the best way to deal with this issue. An uncomplicated approach might be to add a few flags to hybpiper assemble
- something like --no_exonerate_refine
and --no_exonerate_hit_stop_codons
. Ideally, though, I think it'd be good to parse the Exonerate alignment and apply a few thresholds and simple heuristics to pull out the 'good' regions from an alignment that's been produced using the --refine
flag.
I'll wait to receive your other *_contigs.txt
file, so I can investigate your second example properly. In the meantime, I aligned the Hemiptera3-395at7524
protein sequence against a translation of the CDS from the annotated Nilaparvata lugens genome mentioned above. As you can see in the alignment below, the two sequences align well overall, but there's a dubious region around the location of the putative intron in your Exonerate alignment. I'm wondering if there's something amiss with the Hemiptera3-395at7524
sequence, and if this has resulted in Exonerate failing to correctly identify the intron boundaries in SPAdes contig NODE_1_length_2012_cov_8.473173
. Perhaps the intron actually starts a few codons upstream, after the ...AlaLeuHisAspLys
region, which is the exon/intron boundary in the Nilaparvata lugens gene as shown below? Can you tell me how your target file was produced?
Wide view:
Zoomed view:
Cheers,
Chris
Hi Chris,
Thanks for all the investigations!
Sorry for the confusion from NODE_1_length_2012_cov_8.473173
. Yes, it's from a different sample. Here is the contig file:
395at7524_contigs.sample2.txt
As you can see, we are working with planthoppers (Hemiptera: Fulgoromorpha). Nilaparvata lugens is one of the references we used. In the beginning, we tried to extract single-copy BUSCO genes directly from the metagenomic assembly by running BUSCO with the database hemiptera_odb10
. However, this method only produced ~400 complete single-copy genes out of 2500. We assumed that it was because of the low sequencing depth (<10) for the host genome. The host contigs are too short that the complete BUSCO markers cannot be recovered.
Therefore, we turned to Hybpiper. This is how we built our target file: 1) we downloaded the complete genomes of a few close refs of planthoppers and run BUSCO on them with the database '''hemiptera_odb10''' to extract single-copy genes. Here are some refs we used:
Acyrthosiphon pisum ---- Hemiptera1 (https://www.ncbi.nlm.nih.gov/genome/448)
Homalodisca vitripennis ---- Hemiptera2 (https://www.ncbi.nlm.nih.gov/genome/13454)
Nilaparvata lugens ---- Hemiptera3 (https://www.ncbi.nlm.nih.gov/genome/2941; this is exactly the genome you found! )
Zanna intricata ---- Hemiptera4 (https://www.ncbi.nlm.nih.gov/genome/87039)
Then, 2) we combined them with busco_ancestral
(Hemiptera0) to produce our target file. And we asked hybpiper to assemble the markers in this target file with metagenomic reads. In this way, we are able to assemble most markers with a low spades coverage cutoff.
It's very interesting to see that Hemiptera3-395at7524
protein, which was extracted from the same Nilaparvata lugens genome, is different in some regions. My intepretation is that BUSCO decided the exon/intron boundary differently, which causes some intron regions to be retained in the final busco single-copy genes. When we use these genes as targets for hybpiper, they cause more troubles during exonerate step.
It's also interesting to see that the putative intron region was trimmed after removing --refine
flag. But I am not sure how this action will affect the overal results as you said that --refine by default can produce much longer, correct alignments in some cases
.
I really want to hear more thoughts from you. We are working with phylogenomics. Although a few untrimmed introns in our final alignment may not affect the final phylogeny, they still create erroneous alignment for a few samples in these regions of many genes. It would be nice if we can find a solution to this so that we can have a clean alignment to work with at the end.
Sincerely, Junchen
Hi Junchen,
I realised I made a mistake when testing the Exonerate --refine
flag with your data, and that I was using the protein2dna
model rather than the protein2genome
model. When using the protein2dna
model, Exonerate behaves as described above (i.e., the stop-codon-containing intron sequence isn't recovered in the Exonerate hit when --refine
isn't used), but the protein2genome
model recovers the intron sequence in the alignment regardless of whether --refine
is used or not.
So, I'm in the process of adding an additional filter to HybPiper that processes the Exonerate alignments and trims hit boundaries as necessary. Hopefully it'll deal with your issue. It should be ready later this week, but I'll let you know when I've pushed the update.
Cheers,
Chris
Hi Chris,
Thanks very much!
My understanding is that hybpiper uses protein2genome
model by default in Exonerate for the detection of introns. So basically the flag --refine
cannot help with this issue. I am looking forward to the additional filter you are working on!
Cheers, Junchen
Hi Junchen,
I've just pushed HybPiper version 2.1.2 and updated the conda packages - see the changelog here.
It includes a new filter for Exonerate alignments. The filter doesn't aim to remove stop codons explicitly, but rather it should trim any poorly aligned 5' and 3' ends from an Exonerate alignment before the corresponding SPAdes contig hit sequence is potentially incorporated into an output *.FNA
sequence. In my testing, the filter correctly removed the dodgy end of your alignment for NODE_15_length_371_cov_2.302721
.
In addition, HybPiper version 2.1.2 now runs a check for any 'internal' (i.e., non-terminal) stop codons in all final output *.FNA
sequences - it provides a warning if any are found, and writes the corresponding gene names to a text file in the sample directory called <prefix>_genes_with_non_terminal_stop_codons.txt
. I've added a section to the wiki that discusses this in more detail, see here.
I suspect this is something that'll need to be further developed for future versions of HybPiper, but hopefully the current fix deals with some of the issues you saw. Let me know how you go, and if you have any ideas about how this might be done better, please don't hesitate to let me know!
Cheers,
Chris
Hi Chris,
Thanks for the updates! I re-run my samples, and the final alignment looks better!
I did some investigations on one sample and here are a few things that I am curious about: 1) In some cases, the bad-quality regions at the beginning/end were successfully trimmed, but new 'correct' sequences were extracted to fill the gap. This is nice because the tool not only trims the wrong region but also tries to fill the gap by looking for the potentially correct region from other contigs. But if the correct region has already been assembled, why the old version would take the bad-quality region as the final output?
Here is an example from the sample OECSP2: The AA alignment looks like this ('new' is the output from v2.1.2):
OECSP2-old PTTPLSVDVSTIFQVHSKM
|||||| |||||
OECSP2-new PTTPLSMSILDSLTVHSKM
The problematic contig is the following and was taken by the previous version as the final output:
But the correct contig also exists and was taken by the new version after trimming the potential intron region:
2) In some cases, the incomplete intron regions are located in the middle of the contig, which failed to be detected by Exonerate
and trimmed by the new filter.
Here is an example from the same sample:
The poorly aligned regions in the last three lines are potential intron regions. In this case, the query Hemiptera3-395at7524
was perfectly extracted by BUSCO and contains only exons (I compared it with the NCBI record). And the poorly aligned region starts right at the border of the exon/intron, indicating that it's part of the intron. But somehow, the intron is incomplete, and the assembler captures part of the next exon, making the incomplete intron region in the middle.
I am not sure how these internal incomplete introns could be detected. One way could be that the script will detect regions with the reduced alignment score and then decide if this region is an incomplete intron by its size or its sequences at two ends (to my knowledge, introns are often characterized by 'AG---------GT').
Thanks again for the quick updates. I am looking forward to hearing more thoughts from you!
If you want to look at this sample (OECSP2), here is the contig file.
395at7524_contigs_OECSP2.txt
The query file Hemiptera3-395at7524
is the same as the previous one.
Cheers, Junchen
Hello,
Recently we have been using hybpiper to assemble thousands of BUSCO single-copy genes from metagenomic reads (2x150bp). The target file is amino acids, and we set the coverage cutoff for spades to 4 since the sequencing depth is low.
We notice that there are stop codons in many assembled markers, which is not ideal considering that we used amino acid targets. Then, we looked for some stop codons in the output file exonerate_results.fasta. Here are some examples:
It seems that many of them appear at the beginning or the end of an alignment, although some of them can follow right after the annotated intron. In the wiki (https://github.com/mossmatters/HybPiper/wiki/Introns), it said:
My understanding is that exonerate cannot detect incomplete introns at the beginning or the end of a contig. Am I correct? Do you think this could be the reason why we get stop codons in the final assembly?
If this is the issue, do you have any suggestions on how to remove these intron leftovers? Thanks! We are going to use these markers for phylogenomics. But the region around the stop codon often does not align well with the rest of the sequences.
Sincerely, Junchen