Genome guided assembly, too many transcripts, unaffected by genome_guided_min_coverage

EarlyEvol commented 5 years ago

Brian,

I have assembled three wasps genomes and would like to update an on old de novo Trinity assembly by using the genome guided method. When using default settings, for some reason two of the GG assemblies create >211K transcripts while the third creates just 37K. Upon looking at RNA-seq coverage across the genome assemblies, it looks like there is about 1x DNA contamination. To try to mitigate this problem I used the genome_guided_min_coverage option and upped it to 3. Since I am using these transcripts to create an Augustus training set with PASA, I figured missing lowly expressed transcripts wasn't nearly as bad as incorporating a bunch of random genome chunks that get assembled. Surprisingly, 211,648 of 211,649 original transcripts were still assembled.

The wasp genomes are pretty repetitive so perhaps low coverage DNA contamination in the RNA-seq could lead to tons on multimappers making local coverage appear higher than true coverage, then these read clusters can still get assembled? If trinity does use multimappers, should i pre-filter those from the BAM by taking just one alignment for each read?

Any other strategies for dealing with DNA contamination in RNA-seq data would be appreciated.

Thanks, Earl

EarlyEvol commented 5 years ago

Oops, forgot to mention I'm running Trinity-v2.6.6 built with conda

brianjohnhaas commented 5 years ago

Hi,

You might get away with cranking the --min_kmer_cov up from the default (1). There's the danger that this will fragment more lowly expressed transcripts, but it should rid you of low cov from dna contamination. It's worth giving a try. You might run it at --min_kmer_cov 3 and see how it goes.

best,

~b

On Tue, Nov 20, 2018 at 1:37 PM EarlyEvol notifications@github.com wrote:

Oops, forgot to mention I'm running Trinity-v2.6.6 built with conda

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/trinityrnaseq/trinityrnaseq/issues/606#issuecomment-440384361, or mute the thread https://github.com/notifications/unsubscribe-auth/AHMVX2igodrvRcI6UQE5SwJnUasJlBp1ks5uxEvOgaJpZM4YrskO .

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

brianjohnhaas commented 5 years ago

Also, I wouldn't worry about the multimappers unless the aligner you're using is outputting massive numbers of hits (not typical).

best,

~b

On Tue, Nov 20, 2018 at 2:17 PM Brian Haas bhaas@broadinstitute.org wrote:

Hi,

You might get away with cranking the --min_kmer_cov up from the default (1). There's the danger that this will fragment more lowly expressed transcripts, but it should rid you of low cov from dna contamination. It's worth giving a try. You might run it at --min_kmer_cov 3 and see how it goes.

best,

~b

On Tue, Nov 20, 2018 at 1:37 PM EarlyEvol notifications@github.com wrote:

Oops, forgot to mention I'm running Trinity-v2.6.6 built with conda

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/trinityrnaseq/trinityrnaseq/issues/606#issuecomment-440384361, or mute the thread https://github.com/notifications/unsubscribe-auth/AHMVX2igodrvRcI6UQE5SwJnUasJlBp1ks5uxEvOgaJpZM4YrskO .

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

EarlyEvol commented 5 years ago

Thanks for the speedy reply, that makes sense. I think want to minimize false positives, but a fragmented transcriptome also sound bad for ab initio prediction training. I'm still not entirely sure whether false positives or fragmented genes are worse for creating training sets with PASA.

It does look like STAR is outputting lots of alignments in some repetitive regions, but filtering these only reduced the transcript number by about 15%, so they were not that main culprit.

min_kmer_cov 3 output a more reasonable number of transcripts. 211k to 50k. I think I will use this set of transcripts for the first pass at annotation, but will ask the PASA developers what their take is.

Thanks a bunch! Earl

EarlyEvol commented 5 years ago

Just saw you are a developer of PASA, HA. Should have looked first.
Do you have a strong feeling one way or another for whether breaking up transcripts with min_kmer_cov is worse than having a bunch of erroneous transcripts from DNA contamination when generating a Augustus training set? It seems that seqclean would get rid of anything that doesnt look coding, is that right? In that case maybe I want to capture all the transcripts I can (and keep from breaking them up) then rely on seqclean to remove all the background stuff? I'll see real quick if seqclean removes a bunch of those extra transcripts right now.

brianjohnhaas commented 5 years ago

writing code is my favorite :-)

On Mon, Nov 26, 2018 at 3:08 PM EarlyEvol notifications@github.com wrote:

haha, just saw that you are the developer of

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/trinityrnaseq/trinityrnaseq/issues/606#issuecomment-441780059, or mute the thread https://github.com/notifications/unsubscribe-auth/AHMVXx64Ci1VXyrlRmSpnVSOaKMC_SJXks5uzEpPgaJpZM4YrskO .

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

EarlyEvol commented 5 years ago

Well, clearly I didnt understand what seqclean was doing, since it didnt remove any of that stuff. Will pasa deal with this genomic junk internally? I'm guessing yes because it uses transdecoder to find ORFs.

brianjohnhaas commented 5 years ago

seqclean is just for removing polyA and annotating polyA sites. Did cranking up the --min_kmer_cov help w/ the low amt of genomic?

On Mon, Nov 26, 2018 at 4:14 PM EarlyEvol notifications@github.com wrote:

Well, clearly I didnt understand what seqclean was doing, since it didnt remove any of that stuff. Will pasa deal with this genomic junk internally? I'm guessing yes because it uses transdecoder to find ORFs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/trinityrnaseq/trinityrnaseq/issues/606#issuecomment-441800127, or mute the thread https://github.com/notifications/unsubscribe-auth/AHMVX7oon6i5YR9PIqQ3eX_RM_yRu7ZZks5uzFmjgaJpZM4YrskO .

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

EarlyEvol commented 5 years ago

Yeah, min_kmer_cov got the assembly down to a reasonable number.

trinityrnaseq / trinityrnaseq

Genome guided assembly, too many transcripts, unaffected by genome_guided_min_coverage #606

--

--

--

--

--