trinityrnaseq / trinityrnaseq

Trinity RNA-Seq de novo transcriptome assembly
BSD 3-Clause "New" or "Revised" License
812 stars 315 forks source link

Repeated error: failed_butterfly_commands #1092

Open AlexGaithuma opened 2 years ago

AlexGaithuma commented 2 years ago

I have repeated Trinity several times after it failed to finish but seems to be stuck.

I am using singularity exec -e /software/containers/trinityrnaseq.v2.13.2.simg Trinity This is the stdout error I get every time.

I cant complete the assembly. This happens to many files as well. Please help!

I even tried running the command with this CBin.fasta file in a HPRC cluster with no success:

singularity exec -e /software/containers/trinityrnaseq.v2.13.2.simg Trinity  \
--single "/data1/IxoSca/Tickrnaseq/trinity_Bb-MG/read_partitions/Fb_0/CBin_0/c18.trinity.reads.fa"  \
--output "/data1/IxoSca/Tickrnaseq/trinity_Bb-MG/read_partitions/Fb_0/CBin_0/c18.trinity.reads.fa.out"  \
--CPU 1 --max_memory 2G --run_as_paired --seqType fa --trinity_complete --min_contig_length 12 --bflyCPU 4 --bflyGCThreads 4 --no_salmon \
1>/data1/IxoSca/Tickrnaseq/trinity_Bb-MG/read_partitions/Fb_0/CBin_0/c18.trinity.reads.fa.out.log

The error message is:

  We are sorry, commands in file: [failed_butterfly_commands.44832.txt] failed.  :-( 
  Error encountered::  <!----
  CMD: /usr/local/bin/trinity-plugins/BIN/ParaFly -c /scratch/user/akiarieg/Tickrnaseq/c4902.trinity.reads.fa.out/chrysalis/butterfly_commands -shuffle -CPU 28 -failed_cmds failed_butterfly_commands.44832.txt  2>tmp.44832.1636660306.stderr

  Errmsg:
  seq vertex T:W-1(V26087_159_D21005) not selected yet and has pred count: 2
  seq vertex AGGT:W636(V26088_7596_D21231) not selected yet and has pred count: 2
  seq vertex C:W-1(V26089_164_D21458) not selected yet and has pred count: 1
  seq vertex TA:W640(V26090_165_D21671) not selected yet and has pred count: 1
  seq vertex T:W-1(V26091_167_D21879) not selected yet and has pred count: 1
  seq vertex GA:W667(V26092_7604_D22077) not selected yet and has pred count: 1
  seq vertex C:W-1(V26093_170_D22261) not selected yet and has pred count: 1
  seq vertex G:W-1(V26094_7607_D22431) not selected yet and has pred count: 1
  seq vertex C:W-1(V26095_7608_D22596) not selected yet and has pred count: 1
  seq vertex GTATGCCCG:W282(V26096_7609_D22753) not selected yet and has pred count: 1
  seq vertex CT:W700(V32401_26_D13628) not selected yet and has pred count: 2

. . . . continues....

ERROR: after topo sort, still have edge unaccounted for: Edge(240995->48380,w:1.0)
ERROR: after topo sort, still have edge unaccounted for: Edge(45410->45411,w:1.0)
ERROR: after topo sort, still have edge unaccounted for: Edge(45457->45458,w:1.0)
ERROR: after topo sort, still have edge unaccounted for: Edge(45740->45741,w:1.0)
ERROR: after topo sort, still have edge unaccounted for: Edge(32407->32408,w:1.0)
ERROR: after topo sort, still have edge unaccounted for: Edge(41914->41915,w:1.0)
ERROR: after topo sort, still have edge unaccounted for: Edge(45418->45419,w:1.0)
ERROR: after topo sort, still have edge unaccounted for: Edge(41956->41957,w:1.0)
ERROR: after topo sort, still have edge unaccounted for: Edge(32408->32409,w:1.0) 

Exception in thread "main" java.lang.RuntimeException: Error, graph contains at least one cycle and is not a DAG!
 at TopologicalSort.topoSortSeqVerticesDAG(TopologicalSort.java:112)
 at TransAssembly_allProbPaths.ZipMergeRounds(TransAssembly_allProbPaths.java:2612)
 at TransAssembly_allProbPaths.link_residual_INTER_component_unique_nodes(TransAssembly_allProbPaths.java:2868)
 at TransAssembly_allProbPaths.convert_path_DAG_to_SeqVertex_DAG(TransAssembly_allProbPaths.java:2515)
 at TransAssembly_allProbPaths.create_DAG_from_OverlapLayout(TransAssembly_allProbPaths.java:1797)
 at TransAssembly_allProbPaths.main(TransAssembly_allProbPaths.java:967)
warning, cmd: java -Xmx10G -Xms1G -Xss1G   -XX:ParallelGCThreads=28  -jar /usr/local/bin/Butterfly/Butterfly.jar -N 100000 -L 12 -F 500 -C /scratch/user/akiarieg/Tickrnaseq/c4902.trinity.reads.fa.out/chrysalis/Component_bins/Cbin2/c0.graph  --path_reinforcement_distance=25  --NO_EM_REDUCE  failed with ret: 256, going to retry.

--->

Trinity run failed. Must investigate error above.
brianjohnhaas commented 2 years ago

hi,

Can you tar up the folder: /usr/local/bin/trinity-plugins/BIN/ParaFly -c /scratch/user/akiarieg/Tickrnaseq/c4902.trinity.reads.fa.out

and send it to me? I'll take a look.

Send to 'bhaas at broadinstitute.org'

thx,

~brian

On Fri, Nov 12, 2021 at 8:12 PM Alex Kiare Gaithuma < @.***> wrote:

I have Trinity repeated Trinity several times after it failed to finish but seems to be stuck.

I am using singularity exec -e /software/containers/trinityrnaseq.v2.13.2.simg Trinity This is the stdout error I get every time.

I cant complete the assembly. This happens to many files as well. Please help!

I even tried running the command in a HPRC cluster with no success:

singularity exec -e /software/containers/trinityrnaseq.v2.13.2.simg Trinity \ --single "/data1/IxoSca/Tickrnaseq/trinity_${prefix}/readpartitions/${prefix1}/${prefix2}/${prefix3}.trinity.reads.fa" \ --output "/data1/IxoSca/Tickrnaseq/trinity${prefix}/read_partitions/${prefix1}/${prefix2}/${prefix3}.trinity.reads.fa.out" \ --CPU 1 --max_memory 2G --run_as_paired --seqType fa --trinity_complete --full_cleanup --min_contig_length 12 --bflyCPU 4 --bflyGCThreads 4 --no_salmon

The error message is:

We are sorry, commands in file: [failed_butterfly_commands.44832.txt] failed. :-( Error encountered:: <!---- CMD: /usr/local/bin/trinity-plugins/BIN/ParaFly -c /scratch/user/akiarieg/Tickrnaseq/c4902.trinity.reads.fa.out/chrysalis/butterfly_commands -shuffle -CPU 28 -failed_cmds failed_butterfly_commands.44832.txt 2>tmp.44832.1636660306.stderr

Errmsg: seq vertex T:W-1(V26087_159_D21005) not selected yet and has pred count: 2 seq vertex AGGT:W636(V26088_7596_D21231) not selected yet and has pred count: 2 seq vertex C:W-1(V26089_164_D21458) not selected yet and has pred count: 1 seq vertex TA:W640(V26090_165_D21671) not selected yet and has pred count: 1 seq vertex T:W-1(V26091_167_D21879) not selected yet and has pred count: 1 seq vertex GA:W667(V26092_7604_D22077) not selected yet and has pred count: 1 seq vertex C:W-1(V26093_170_D22261) not selected yet and has pred count: 1 seq vertex G:W-1(V26094_7607_D22431) not selected yet and has pred count: 1 seq vertex C:W-1(V26095_7608_D22596) not selected yet and has pred count: 1 seq vertex GTATGCCCG:W282(V26096_7609_D22753) not selected yet and has pred count: 1 seq vertex CT:W700(V32401_26_D13628) not selected yet and has pred count: 2

. . . . continues....

ERROR: after topo sort, still have edge unaccounted for: Edge(240995->48380,w:1.0) ERROR: after topo sort, still have edge unaccounted for: Edge(45410->45411,w:1.0) ERROR: after topo sort, still have edge unaccounted for: Edge(45457->45458,w:1.0) ERROR: after topo sort, still have edge unaccounted for: Edge(45740->45741,w:1.0) ERROR: after topo sort, still have edge unaccounted for: Edge(32407->32408,w:1.0) ERROR: after topo sort, still have edge unaccounted for: Edge(41914->41915,w:1.0) ERROR: after topo sort, still have edge unaccounted for: Edge(45418->45419,w:1.0) ERROR: after topo sort, still have edge unaccounted for: Edge(41956->41957,w:1.0) ERROR: after topo sort, still have edge unaccounted for: Edge(32408->32409,w:1.0)

Exception in thread "main" java.lang.RuntimeException: Error, graph contains at least one cycle and is not a DAG! at TopologicalSort.topoSortSeqVerticesDAG(TopologicalSort.java:112) at TransAssembly_allProbPaths.ZipMergeRounds(TransAssembly_allProbPaths.java:2612) at TransAssembly_allProbPaths.link_residual_INTER_component_unique_nodes(TransAssembly_allProbPaths.java:2868) at TransAssembly_allProbPaths.convert_path_DAG_to_SeqVertex_DAG(TransAssembly_allProbPaths.java:2515) at TransAssembly_allProbPaths.create_DAG_from_OverlapLayout(TransAssembly_allProbPaths.java:1797) at TransAssembly_allProbPaths.main(TransAssembly_allProbPaths.java:967) warning, cmd: java -Xmx10G -Xms1G -Xss1G -XX:ParallelGCThreads=28 -jar /usr/local/bin/Butterfly/Butterfly.jar -N 100000 -L 12 -F 500 -C /scratch/user/akiarieg/Tickrnaseq/c4902.trinity.reads.fa.out/chrysalis/Component_bins/Cbin2/c0.graph --path_reinforcement_distance=25 --NO_EM_REDUCE failed with ret: 256, going to retry.

--->

Trinity run failed. Must investigate error above.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/trinityrnaseq/trinityrnaseq/issues/1092, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX7ODS466ENBL4KUD3TULW3OXANCNFSM5H6AILWQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

AlexGaithuma commented 2 years ago

Thanks @brianjohnhaas. I sent the file.

brianjohnhaas commented 2 years ago

Hi Alex,

It looks like you've got a very complex component there that's pushing Butterfly to it's limits. I've tried running it with a lot of RAM and I'm getting stack overflows. If all the other components are finishing ok, then try running Trinity with --FORCE to have it wrap up what it could assemble. You can take that remaining component's reads.fa file and try assembling it with something else. Please keep this issue open and I'll continue to explore this to see if I can tackle it for a future software update.

best,

~brian

On Mon, Nov 15, 2021 at 7:55 PM Alex Kiare Gaithuma < @.***> wrote:

Thanks @brianjohnhaas https://github.com/brianjohnhaas. I sent the file.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trinityrnaseq/trinityrnaseq/issues/1092#issuecomment-969556455, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX5SJQ4POI4XDRJWM4TUMGTZRANCNFSM5H6AILWQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

AlexGaithuma commented 2 years ago

Hi Brian,

Just to get back to you on this. I investigated and found this section of reads are from a tandem repeat region of an intron (data is tick rnaseq data). could be helpful to explain why its problematic. All the reads match Ixodes scapularis clone 01_E21 tandem repeat region.

brianjohnhaas commented 2 years ago

Thanks, Alex. I saw that too -- the inchworm step reconstructed a couple reasonably long contigs and I blast'd one of them at ncbi. I'm sure the repeat structure coupled with any polymorphisms between repeat instances are largely causing trouble for Butterfly here.

On Wed, Nov 17, 2021 at 5:59 PM Alex Kiare Gaithuma < @.***> wrote:

Hi Brian,

Just to get back to you on this. I investigated and found this section of reads are from a tandem repeat region of an intron (data is tick rnaseq data). could be helpful to explain why its problematic. All the reads match Ixodes scapularis clone 01_E21 tandem repeat region https://www.ncbi.nlm.nih.gov/nucleotide/GU318629.1?report=genbank&log$=nucltop&blast_rank=1&RID=TB1SRKYD013 .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trinityrnaseq/trinityrnaseq/issues/1092#issuecomment-972222738, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKXZLV35UKFSZDSXG6QDUMQXUPANCNFSM5H6AILWQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

lachemontes commented 1 year ago

Hi Brian,

I have the same issues. My Trinity run stuck at 99.9998.

- retaining Trinity transcripts provided as input to salmon, w/o filtering (pre-salmon mode). succeeded(469410) 99.9998% completed. Additionally, I got the error that Butterfly failed.

This is the command line I run:

module load bioinfo-tools trinity/2.9.1 module load bioinfo-tools jellyfish/2.2.6

Trinity --seqType fq --max_memory 110G \ --samples_file /proj/snic2022-23-541/Ticks_project/Analysis/Trinity/STAR/files_Trinity.txt \ --output ../Trinity_deNovo

I am working with RNAseq data from another Tick, similar to @AlexGaithuma; I have 128 GB of reads. My Job died because of the timeout, unfortunately. Chrysalis and inchworm seem ok.

Would you happen to have any suggestions to solve my problem?

Thanks!

brianjohnhaas commented 1 year ago

It's probably an endosymbiont or pathogen that's in the tick that's attempting to assemble. If you can figure out which read cluster that's difficult to assemble (ie. can see the command that's running via 'ps -auxww | grep Butterfly', then you can try tackling that set of reads separately to see what it is, or look at its current inchworm contigs to see what's there - just blast long contigs at ncbi).

To get the assembly job to just finish up, you can kill the current job and then rerun it with the --FORCE option. It won't do any more assembling but rather just wrap up what it could assemble.

hope this helps,

~b

On Mon, Mar 20, 2023 at 9:59 AM lachemontes @.***> wrote:

Hi Brian,

I have the same issues. My Trinity run stuck at 99.9998.

  • retaining Trinity transcripts provided as input to salmon, w/o filtering (pre-salmon mode). succeeded(469410) 99.9998% completed. Additionally, I got the error that Butterfly failed.

This is the command line I run:

module load bioinfo-tools trinity/2.9.1 module load bioinfo-tools jellyfish/2.2.6

Trinity --seqType fq --max_memory 110G --samples_file /proj/snic2022-23-541/Ticks_project/Analysis/Trinity/STAR/files_Trinity.txt --output ../Trinity_deNovo

I am working with RNAseq data from another Tick, similar to @AlexGaithuma https://github.com/AlexGaithuma; I have 128 GB of reads. My Job died because of the timeout, unfortunately. Chrysalis and inchworm seem ok.

Would you happen to have any suggestions to solve my problem?

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/trinityrnaseq/trinityrnaseq/issues/1092#issuecomment-1476285811, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX7HRSELKCD3ZXRT7BLW5BPEHANCNFSM5H6AILWQ . You are receiving this because you were mentioned.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

lachemontes commented 1 year ago

Thanks!

AlexGaithuma commented 1 year ago

@lachemontes I found out it is best to map to similar genomes first. Assemble the unmapped reads first, blast to confirm non tick contigs) and use that assembly to filter out non-tick genes.

lachemontes commented 1 year ago

"Hey! Thanks, @AlexGaithuma. Which genome species did you choose to map your reads? I am working with Ixodes ricinus. If you are okay with it, what program are you using to map them? I use STAR." Would you happen to have a link to share with me to do what you suggest? I have been trying to assemble my transcriptome for a month and a half without success. Thank you very much

AlexGaithuma commented 1 year ago

Hi @lachemontes I use bbmap, and map to several closely related genomes. Build the reference indexes

 bbmap.sh \
 threads=16 \
 build=1 \
 ref=Ref1.fasta

 bbmap.sh \
 threads=16 \
 build=2 \
 ref=Ref2.fasta

In your case, I would map to;

  1. Ixodes ricinus (assembly ASM97304v2)
  2. Ixodes scapularis (assembly ASM1692078v2)
  3. Ixodes persulcatus (assembly BIME_Iper_1.3)

Using bbmap, you retrieve the unmapped reads to the first genome and map to the second genome etc..

 for FNAME in $DIR/data/*_1.fastq.gz
 do
 SAMPLE=$(basename $FNAME _1.fastq.gz)
 r1=" $DIR/data/${SAMPLE}.1.fastq.gz"
 r2=" $DIR/data/${SAMPLE}.2.fastq.gz" 
 bbmap.sh \
 in=$r1 \
 in2=$r2 \
 build=1 \
 threads=8 \
 maxindel=200k \
 xs=us \
 sam=1.3 \
 -Xmx10g \
 outm=$WORKDIR/Ref1/${SAMPLE}.mapped.fq \
 outu=$WORKDIR/Ref1/${SAMPLE}.unmapped.fq \
 statsfile=$WORKDIR/Ref1/${SAMPLE}.mapstats.txt
 done

split unmapped reads to paired reads using bbmap's reformat.sh script

 for FNAME in $DIR/data/*_1.fastq.gz
 do
 SAMPLE=$(basename $FNAME _1.fastq.gz)
 reformat.sh \
 in=$WORKDIR/Ref1/${SAMPLE}.unmapped.fq \
 out1=$WORKDIR/Ref1/${SAMPLE}.unmapped.1.fq \
 out2=$WORKDIR/Ref1/${SAMPLE}.unmapped.2.fq
 done

map the unmapped to the second genome..... and the same to the third genome.

Check the final unmapped reads. They should be far fewer...you can assemble them separately and blast to see if any tick sequences remained. If there are reads, just map the reads to the assembly and retrieve them.

cat all reads mapping to tick sequences and Finally assemble them.....

lachemontes commented 1 year ago

@AlexGaithuma ,Thank you so much for your suggestion, I will try it, and I hope this approach works for me! By the way, Ixodes ricinus (assembly ASM97304v2) is highly fragmented and only has 20 of completeness and single copy BUSCO genes, I don't recommend you to use it for further analysis.