BoredMa commented 6 years ago

I ran the command as follow: ./bin/sparkhit piper --master yarn --deploy-mode client --num-executors 5 --executor-cores 30 --executor-memory 10g -fastq /biodata/bwa/ERR764952.fastq -outfile /biodata/bwa/output -tool "/usr/local/hk_soft/bwa-0.7.17/bwa mem" -toolparam "/biodata/bwa/reference.fasta".

but i met this error.Did i do anything wrong? I have copied bwa tools to all nodes, and fastq fasta files in hdfs.

Caused by: java.lang.IllegalStateException: Subprocess exited with status 1. Command ran: /usr/local/hk_soft/bwa-0.7.17/bwa mem /biodata/bwa/reference.fasta at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:178) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1210) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1218) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1197) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)

rhinempi commented 6 years ago

Hi,

I think for the complete bwa mem command, you also need the input stream.

Here "-tool" parameter gives part of the command: "/usr/local/hk_soft/bwa-0.7.17/bwa mem" "-toolparam" parameter gives part of the command: "/biodata/bwa/reference.fasta" Together they formed a command "/usr/local/hk_soft/bwa-0.7.17/bwa mem /biodata/bwa/reference.fasta"

Here comes the tricky part: -fastq /biodata/bwa/ERR764952.fastq only reads the data into the Spark RDD. To send the data to bwa, we need to specify the input stream in the command, which is the STDIN. So the command should be:

-toolparam "/biodata/bwa/reference.fasta /dev/stdin". so that the complete command is: "/usr/local/hk_soft/bwa-0.7.17/bwa mem /biodata/bwa/reference.fasta /dev/stdin".

BoredMa commented 6 years ago

Thanks for your reply. I make it as you said,but i have another question.

As you said,"-fastq /biodata/bwa/ERR764952.fastq only reads the data into the Spark RDD.To send the data to bwa, we need to specify the input stream in the command, which is the STDIN. "

Then,how could i use external tools like dsk which require a parameter -file .Do i need to convert STDIN to file,and then invoke dsk?

The command i use as follow occur an error: ./bin/sparkhit piper --master yarn --deploy-mode client --num-executors 5 --executor-cores 30 --executor-memory 10g -fastq /biodata/insert_335_B/insert_180* -tofasta -outfile /biodata/insert_335_B/output -tool "/usr/local/hk_soft/dsk-2.1.0-Linux/bin/dsk" -toolparam " -file /dev/stdin -kmer-size 11 -abundance-min 1"

Can you give me an example to invoke external tools that require -file parameter. thanks.

rhinempi commented 6 years ago

Hi BoradMa,

This is a special situation that the code creates a special file handle call "Bank", which has special functions such as jumping back and forth in the file. Such file handle can't read from a input stream, such as the /dev/stdin.

A workaround is to create an random file in the script and write RDD data into the file, so that Dsk can read directly from local file system. Example:

create a shell script:

random_file.sh


# Created by rhinempi on 04/10/18.
#
#       Sparkhit
#
# Copyright (c) 2015-2015
#
#       Liren Huang      <huanglr at cebitec.uni-bielefeld.de>
#
# SparkHit is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option)
# any later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; Without even the implied warranty of MERCHANTABILITY or
# FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
# more detail.
#
# You should have received a copy of the GNU General Public License along
# with this program. If not, see <http://www.gnu.org/licenses>.

create a random file and stream RDD data (stdin) into that file.

file=$RANDOM cat > /tmp/$file.fa

read locally from the file

/usr/local/hk_soft/dsk-2.1.0-Linux/bin/dsk -file /tmp/$file.fa -kmer-size 11 -abundance-min 1

remove temporary files afterwards

rm -rf /tmp/$file.fa



in the Sparkhit command, simply invoke this script 
>./bin/sparkhit piper --master yarn --deploy-mode client --num-executors 5 --executor-cores 30 --executor-memory 10g -fastq /biodata/insert_335_B/insert_180* -tofasta -outfile /biodata/insert_335_B/output -tool "/usr/local/hk_soft/dsk-2.1.0-Linux/bin/random_file.sh" -kmer-size 11 -abundance-min 1"

Let me know if there are anymore questions.

Liren

rhinempi / sparkhit

java.lang.IllegalStateException #3

create a random file and stream RDD data (stdin) into that file.

read locally from the file

remove temporary files afterwards