Open BoredMa opened 6 years ago
Hi,
I think for the complete bwa mem command, you also need the input stream.
Here "-tool" parameter gives part of the command: "/usr/local/hk_soft/bwa-0.7.17/bwa mem" "-toolparam" parameter gives part of the command: "/biodata/bwa/reference.fasta" Together they formed a command "/usr/local/hk_soft/bwa-0.7.17/bwa mem /biodata/bwa/reference.fasta"
Here comes the tricky part: -fastq /biodata/bwa/ERR764952.fastq only reads the data into the Spark RDD. To send the data to bwa, we need to specify the input stream in the command, which is the STDIN. So the command should be:
-toolparam "/biodata/bwa/reference.fasta /dev/stdin". so that the complete command is: "/usr/local/hk_soft/bwa-0.7.17/bwa mem /biodata/bwa/reference.fasta /dev/stdin".
Thanks for your reply. I make it as you said,but i have another question.
As you said,"-fastq /biodata/bwa/ERR764952.fastq only reads the data into the Spark RDD.To send the data to bwa, we need to specify the input stream in the command, which is the STDIN. "
Then,how could i use external tools like dsk which require a parameter -file
The command i use as follow occur an error: ./bin/sparkhit piper --master yarn --deploy-mode client --num-executors 5 --executor-cores 30 --executor-memory 10g -fastq /biodata/insert_335_B/insert_180* -tofasta -outfile /biodata/insert_335_B/output -tool "/usr/local/hk_soft/dsk-2.1.0-Linux/bin/dsk" -toolparam " -file /dev/stdin -kmer-size 11 -abundance-min 1"
Can you give me an example to invoke external tools that require -file
Hi BoradMa,
This is a special situation that the code creates a special file handle call "Bank", which has special functions such as jumping back and forth in the file. Such file handle can't read from a input stream, such as the /dev/stdin.
A workaround is to create an random file in the script and write RDD data into the file, so that Dsk can read directly from local file system. Example:
create a shell script:
random_file.sh
# Created by rhinempi on 04/10/18. # # Sparkhit # # Copyright (c) 2015-2015 # # Liren Huang <huanglr at cebitec.uni-bielefeld.de> # # SparkHit is free software: you can redistribute it and/or modify it # under the terms of the GNU General Public License as published by the Free # Software Foundation, either version 3 of the License, or (at your option) # any later version. # # This program is distributed in the hope that it will be useful, but WITHOUT # ANY WARRANTY; Without even the implied warranty of MERCHANTABILITY or # FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for # more detail. # # You should have received a copy of the GNU General Public License along # with this program. If not, see <http://www.gnu.org/licenses>.
file=$RANDOM cat > /tmp/$file.fa
/usr/local/hk_soft/dsk-2.1.0-Linux/bin/dsk -file /tmp/$file.fa -kmer-size 11 -abundance-min 1
rm -rf /tmp/$file.fa
in the Sparkhit command, simply invoke this script
>./bin/sparkhit piper --master yarn --deploy-mode client --num-executors 5 --executor-cores 30 --executor-memory 10g -fastq /biodata/insert_335_B/insert_180* -tofasta -outfile /biodata/insert_335_B/output -tool "/usr/local/hk_soft/dsk-2.1.0-Linux/bin/random_file.sh" -kmer-size 11 -abundance-min 1"
Let me know if there are anymore questions.
Liren
I ran the command as follow: ./bin/sparkhit piper --master yarn --deploy-mode client --num-executors 5 --executor-cores 30 --executor-memory 10g -fastq /biodata/bwa/ERR764952.fastq -outfile /biodata/bwa/output -tool "/usr/local/hk_soft/bwa-0.7.17/bwa mem" -toolparam "/biodata/bwa/reference.fasta".
but i met this error.Did i do anything wrong? I have copied bwa tools to all nodes, and fastq fasta files in hdfs.
Caused by: java.lang.IllegalStateException: Subprocess exited with status 1. Command ran: /usr/local/hk_soft/bwa-0.7.17/bwa mem /biodata/bwa/reference.fasta at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:178) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1210) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1218) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1197) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)