statgenetics / seqspark

SEQSpark documentation
https://statgenetics.github.io/seqspark/
Apache License 2.0
18 stars 6 forks source link

hdfs not available in my cluster #3

Open trptyrphe11 opened 6 years ago

trptyrphe11 commented 6 years ago

Hi, I am using our HPC cluster which is equipped with Spark to run seqSpark. During installalation, seqSpark itself is installed successful but the databases are not downloaded because we don't have hdfs available. Is there alternative way to download the dependency databases to run seqSpark? Thank you.

zhangdi-devel commented 6 years ago

Hi,

We are working on allowing SEQSpark to run locally. It should be available very soon. 

In the meantime, could you try to install a single server version HDFS in your HPC? 

    Here is the official one page tutorial for installing Hadoop: http://hadoop.apache.org/docs/r2.9.0/hadoop-project-dist/hadoop-common/SingleCluster.html <http://hadoop.apache.org/docs/r2.9.0/hadoop-project-dist/hadoop-common/SingleCluster.html>

You can install it without root privilege.

On Mar 29, 2018, at 2:41 PM, trptyrphe11 notifications@github.com wrote:

Hi, I am using our HPC cluster which is equipped with Spark to run seqSpark. During installalation, seqSpark itself is installed successful but the databases are not downloaded because we don't have hdfs available. Is there alternative way to download the dependency databases to run seqSpark? Thank you.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgenetics/seqspark/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AGSK1nHhItfX06pjCIzta4meLURk5GkTks5tjTjhgaJpZM4TA4Nk.

trptyrphe11 commented 6 years ago

I have manually downloaded databases now. Can I run seqSpark with file path instead of putting into hdfs system? If so, does it compromise anything (speed, accuracy, etc.)? Thank you.

zhangdi-devel commented 6 years ago

Hi,

I think you can run the SEQSpark develop branch with the the local paths. Of course, the IO speed will be limited by the underlying filesystem. In my experience, the performance of NFS, the shared filesystem which is usually used in HPC, is not very good.

On Mar 30, 2018, at 7:37 AM, trptyrphe11 notifications@github.com wrote:

I have manually downloaded databases now. Can I run seqSpark with file path instead of putting into hdfs system? If so, does it compromise anything (speed, accuracy, etc.)? Thank you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgenetics/seqspark/issues/3#issuecomment-377515726, or mute the thread https://github.com/notifications/unsubscribe-auth/AGSK1iVw8dXOBI1GUCw88_ADtx_WIfFsks5tjib6gaJpZM4TA4Nk.

trptyrphe11 commented 6 years ago

Yes I am playing with setting up the file path to local path and yes our cluster is NFS. I have tried 8 cores with m_mem set to 12G but got Java memory error. I have finished the summarizing genotype step with 2 cores and 40g but it is much slower. Do you have any recommendation of cores/memory for such setting? Or is there a way to set up Java xmx parameter in seqspark to specify it's max memory use to avoid such out of memory error? Thank you.

On Fri, Mar 30, 2018, 23:33 zhangdi-devel notifications@github.com wrote:

Hi,

I think you can run the SEQSpark develop branch with the the local paths. Of course, the IO speed will be limited by the underlying filesystem. In my experience, the performance of NFS, the shared filesystem which is usually used in HPC, is not very good.

On Mar 30, 2018, at 7:37 AM, trptyrphe11 notifications@github.com wrote:

I have manually downloaded databases now. Can I run seqSpark with file path instead of putting into hdfs system? If so, does it compromise anything (speed, accuracy, etc.)? Thank you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/statgenetics/seqspark/issues/3#issuecomment-377515726>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AGSK1iVw8dXOBI1GUCw88_ADtx_WIfFsks5tjib6gaJpZM4TA4Nk .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/statgenetics/seqspark/issues/3#issuecomment-377658913, or mute the thread https://github.com/notifications/unsubscribe-auth/AMEG0Oi_rQYbyCZQw_qgnEX3GD5wGeYYks5tjudigaJpZM4TA4Nk .

zhangdi-devel commented 6 years ago

Hi,

Please be aware that NFS is not a distributive FS, so although you can visit the blocks from different nodes, the speed is still limited by IO of the NFS server. If you have a large dataset, e.g. 100GB+, set up a new Hadoop + Spark cluster might be a better idea. 

In general, I recommend match each executor core with 4Gb memory. To configure the memory for Spark, please visit this page: 

     https://spark.apache.org/docs/2.1.2/configuration.html <https://spark.apache.org/docs/2.1.2/configuration.html>

You need to set up YARN or Mesos if you are not using Spark standalone mode.

Di

On Mar 31, 2018, at 6:14 AM, trptyrphe11 notifications@github.com wrote:

Yes I am playing with setting up the file path to local path and yes our cluster is NFS. I have tried 8 cores with m_mem set to 12G but got Java memory error. I have finished the summarizing genotype step with 2 cores and 40g but it is much slower. Do you have any recommendation of cores/memory for such setting? Or is there a way to set up Java xmx parameter in seqspark to specify it's max memory use to avoid such out of memory error? Thank you.

On Fri, Mar 30, 2018, 23:33 zhangdi-devel notifications@github.com wrote:

Hi,

I think you can run the SEQSpark develop branch with the the local paths. Of course, the IO speed will be limited by the underlying filesystem. In my experience, the performance of NFS, the shared filesystem which is usually used in HPC, is not very good.

On Mar 30, 2018, at 7:37 AM, trptyrphe11 notifications@github.com wrote:

I have manually downloaded databases now. Can I run seqSpark with file path instead of putting into hdfs system? If so, does it compromise anything (speed, accuracy, etc.)? Thank you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/statgenetics/seqspark/issues/3#issuecomment-377515726>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AGSK1iVw8dXOBI1GUCw88_ADtx_WIfFsks5tjib6gaJpZM4TA4Nk .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/statgenetics/seqspark/issues/3#issuecomment-377658913, or mute the thread https://github.com/notifications/unsubscribe-auth/AMEG0Oi_rQYbyCZQw_qgnEX3GD5wGeYYks5tjudigaJpZM4TA4Nk .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgenetics/seqspark/issues/3#issuecomment-377685582, or mute the thread https://github.com/notifications/unsubscribe-auth/AGSK1hXpBkngP_NRwtEAM-gyP5eK7AEhks5tj2UggaJpZM4TA4Nk.

trptyrphe11 commented 6 years ago

Thanks. What about data stored on s3 (say, implementing seqspark on Databricks?)

On Mar 31, 2018, at 11:50 PM, zhangdi-devel notifications@github.com wrote:

Hi,

Please be aware that NFS is not a distributive FS, so although you can visit the blocks from different nodes, the speed is still limited by IO of the NFS server. If you have a large dataset, e.g. 100GB+, set up a new Hadoop + Spark cluster might be a better idea.

In general, I recommend match each executor core with 4Gb memory. To configure the memory for Spark, please visit this page:

https://spark.apache.org/docs/2.1.2/configuration.html https://spark.apache.org/docs/2.1.2/configuration.html

You need to set up YARN or Mesos if you are not using Spark standalone mode.

Di

On Mar 31, 2018, at 6:14 AM, trptyrphe11 notifications@github.com wrote:

Yes I am playing with setting up the file path to local path and yes our cluster is NFS. I have tried 8 cores with m_mem set to 12G but got Java memory error. I have finished the summarizing genotype step with 2 cores and 40g but it is much slower. Do you have any recommendation of cores/memory for such setting? Or is there a way to set up Java xmx parameter in seqspark to specify it's max memory use to avoid such out of memory error? Thank you.

On Fri, Mar 30, 2018, 23:33 zhangdi-devel notifications@github.com wrote:

Hi,

I think you can run the SEQSpark develop branch with the the local paths. Of course, the IO speed will be limited by the underlying filesystem. In my experience, the performance of NFS, the shared filesystem which is usually used in HPC, is not very good.

On Mar 30, 2018, at 7:37 AM, trptyrphe11 notifications@github.com wrote:

I have manually downloaded databases now. Can I run seqSpark with file path instead of putting into hdfs system? If so, does it compromise anything (speed, accuracy, etc.)? Thank you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/statgenetics/seqspark/issues/3#issuecomment-377515726>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AGSK1iVw8dXOBI1GUCw88_ADtx_WIfFsks5tjib6gaJpZM4TA4Nk .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/statgenetics/seqspark/issues/3#issuecomment-377658913, or mute the thread https://github.com/notifications/unsubscribe-auth/AMEG0Oi_rQYbyCZQw_qgnEX3GD5wGeYYks5tjudigaJpZM4TA4Nk .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgenetics/seqspark/issues/3#issuecomment-377685582, or mute the thread https://github.com/notifications/unsubscribe-auth/AGSK1hXpBkngP_NRwtEAM-gyP5eK7AEhks5tj2UggaJpZM4TA4Nk.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

zhangdi-devel commented 6 years ago

Hi,

I personally do not have direct experience with S3. However, it is a distributive FS and I think the performance depends on how well it is supported by Spark. I will be very glad if you would like to try it and share the experience with us.

Di

On Apr 1, 2018, at 6:51 AM, trptyrphe11 notifications@github.com wrote:

Thanks. What about data stored on s3 (say, implementing seqspark on Databricks?)

On Mar 31, 2018, at 11:50 PM, zhangdi-devel notifications@github.com wrote:

Hi,

Please be aware that NFS is not a distributive FS, so although you can visit the blocks from different nodes, the speed is still limited by IO of the NFS server. If you have a large dataset, e.g. 100GB+, set up a new Hadoop + Spark cluster might be a better idea.

In general, I recommend match each executor core with 4Gb memory. To configure the memory for Spark, please visit this page:

https://spark.apache.org/docs/2.1.2/configuration.html https://spark.apache.org/docs/2.1.2/configuration.html

You need to set up YARN or Mesos if you are not using Spark standalone mode.

Di

On Mar 31, 2018, at 6:14 AM, trptyrphe11 notifications@github.com wrote:

Yes I am playing with setting up the file path to local path and yes our cluster is NFS. I have tried 8 cores with m_mem set to 12G but got Java memory error. I have finished the summarizing genotype step with 2 cores and 40g but it is much slower. Do you have any recommendation of cores/memory for such setting? Or is there a way to set up Java xmx parameter in seqspark to specify it's max memory use to avoid such out of memory error? Thank you.

On Fri, Mar 30, 2018, 23:33 zhangdi-devel notifications@github.com wrote:

Hi,

I think you can run the SEQSpark develop branch with the the local paths. Of course, the IO speed will be limited by the underlying filesystem. In my experience, the performance of NFS, the shared filesystem which is usually used in HPC, is not very good.

On Mar 30, 2018, at 7:37 AM, trptyrphe11 notifications@github.com wrote:

I have manually downloaded databases now. Can I run seqSpark with file path instead of putting into hdfs system? If so, does it compromise anything (speed, accuracy, etc.)? Thank you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/statgenetics/seqspark/issues/3#issuecomment-377515726>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AGSK1iVw8dXOBI1GUCw88_ADtx_WIfFsks5tjib6gaJpZM4TA4Nk .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/statgenetics/seqspark/issues/3#issuecomment-377658913, or mute the thread https://github.com/notifications/unsubscribe-auth/AMEG0Oi_rQYbyCZQw_qgnEX3GD5wGeYYks5tjudigaJpZM4TA4Nk .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgenetics/seqspark/issues/3#issuecomment-377685582, or mute the thread https://github.com/notifications/unsubscribe-auth/AGSK1hXpBkngP_NRwtEAM-gyP5eK7AEhks5tj2UggaJpZM4TA4Nk.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgenetics/seqspark/issues/3#issuecomment-377781447, or mute the thread https://github.com/notifications/unsubscribe-auth/AGSK1u_D12R4Cpw7NBOIsEECBCztos6Iks5tkL9VgaJpZM4TA4Nk.

trptyrphe11 commented 6 years ago

Thanks, Di. If I go that path I will share my experience with you.

On Sun, Apr 1, 2018, 22:48 zhangdi-devel notifications@github.com wrote:

Hi,

I personally do not have direct experience with S3. However, it is a distributive FS and I think the performance depends on how well it is supported by Spark. I will be very glad if you would like to try it and share the experience with us.

Di

On Apr 1, 2018, at 6:51 AM, trptyrphe11 notifications@github.com wrote:

Thanks. What about data stored on s3 (say, implementing seqspark on Databricks?)

On Mar 31, 2018, at 11:50 PM, zhangdi-devel notifications@github.com wrote:

Hi,

Please be aware that NFS is not a distributive FS, so although you can visit the blocks from different nodes, the speed is still limited by IO of the NFS server. If you have a large dataset, e.g. 100GB+, set up a new Hadoop + Spark cluster might be a better idea.

In general, I recommend match each executor core with 4Gb memory. To configure the memory for Spark, please visit this page:

https://spark.apache.org/docs/2.1.2/configuration.html < https://spark.apache.org/docs/2.1.2/configuration.html>

You need to set up YARN or Mesos if you are not using Spark standalone mode.

Di

On Mar 31, 2018, at 6:14 AM, trptyrphe11 notifications@github.com wrote:

Yes I am playing with setting up the file path to local path and yes our cluster is NFS. I have tried 8 cores with m_mem set to 12G but got Java memory error. I have finished the summarizing genotype step with 2 cores and 40g but it is much slower. Do you have any recommendation of cores/memory for such setting? Or is there a way to set up Java xmx parameter in seqspark to specify it's max memory use to avoid such out of memory error? Thank you.

On Fri, Mar 30, 2018, 23:33 zhangdi-devel notifications@github.com wrote:

Hi,

I think you can run the SEQSpark develop branch with the the local paths. Of course, the IO speed will be limited by the underlying filesystem. In my experience, the performance of NFS, the shared filesystem which is usually used in HPC, is not very good.

On Mar 30, 2018, at 7:37 AM, trptyrphe11 < notifications@github.com> wrote:

I have manually downloaded databases now. Can I run seqSpark with file path instead of putting into hdfs system? If so, does it compromise anything (speed, accuracy, etc.)? Thank you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/statgenetics/seqspark/issues/3#issuecomment-377515726>, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AGSK1iVw8dXOBI1GUCw88_ADtx_WIfFsks5tjib6gaJpZM4TA4Nk

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/statgenetics/seqspark/issues/3#issuecomment-377658913>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AMEG0Oi_rQYbyCZQw_qgnEX3GD5wGeYYks5tjudigaJpZM4TA4Nk

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/statgenetics/seqspark/issues/3#issuecomment-377685582>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AGSK1hXpBkngP_NRwtEAM-gyP5eK7AEhks5tj2UggaJpZM4TA4Nk .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/statgenetics/seqspark/issues/3#issuecomment-377781447>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AGSK1u_D12R4Cpw7NBOIsEECBCztos6Iks5tkL9VgaJpZM4TA4Nk .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/statgenetics/seqspark/issues/3#issuecomment-377841752, or mute the thread https://github.com/notifications/unsubscribe-auth/AMEG0FIvvRjvJYAYXe7jGpjpVhwyXUfrks5tkZFkgaJpZM4TA4Nk .

trptyrphe11 commented 6 years ago

Hi Di, I got multiple errors when running the single variant association step. I removed the QC chunk in the example in the manual but keep the rest the same. The error message I got is:

18/04/03 10:19:00 INFO ds.Phenotype$: creating phenotype dataframe from simulated.tsv 18/04/03 10:19:10 INFO worker.Import$: start import ... 18/04/03 10:19:10 INFO worker.Import$: using all variants 18/04/03 10:19:10 INFO worker.Import$: using filter: true 18/04/03 10:19:10 INFO worker.Variants$: decompose multi-allelic variants 18/04/03 10:19:10 INFO worker.QualityControl$: start quality control 18/04/03 10:19:10 INFO worker.Genotypes$: start genotype QC 18/04/03 10:19:10 INFO worker.Genotypes$: no need to perform genotype QC 18/04/03 10:29:02 INFO worker.QualityControl$: 872217 variants before QC 18/04/03 10:29:02 WARN internal.Logging$class: Lost task 3.0 in stage 2.0 (TID 1025, 172.16.22.38, executor 2): java.util.NoSuchElementException: key not found: SS_RawGeno at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.dizhang.seqspark.worker.QualityControl$$anonfun$3.apply(QualityControl.scala:135) at org.dizhang.seqspark.worker.QualityControl$$anonfun$3.apply(QualityControl.scala:135) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185) at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1012) at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1010) at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2125) at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2125) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

18/04/03 10:29:02 ERROR internal.Logging$class: Task 9 in stage 2.0 failed 4 times; aborting job 18/04/03 10:29:02 ERROR seqspark.SingleStudy$: Something went wrong, exit

zhangdi-devel commented 6 years ago

Hi,

It seems like an old bug that occurs when you don’t perform any genotype level QC. 

Could you please pull the latest code and try again? 

P.S. I just merged the develop branch to master branch, so please use the master branch. 

On Apr 3, 2018, at 10:51 AM, trptyrphe11 notifications@github.com wrote:

Hi Di, I got multiple errors when running the single variant association step. I removed the QC chunk in the example in the manual but keep the rest the same. The error message I got is:

18/04/03 10:19:00 INFO ds.Phenotype$: creating phenotype dataframe from simulated.tsv 18/04/03 10:19:10 INFO worker.Import$: start import ... 18/04/03 10:19:10 INFO worker.Import$: using all variants 18/04/03 10:19:10 INFO worker.Import$: using filter: true 18/04/03 10:19:10 INFO worker.Variants$: decompose multi-allelic variants 18/04/03 10:19:10 INFO worker.QualityControl$: start quality control 18/04/03 10:19:10 INFO worker.Genotypes$: start genotype QC 18/04/03 10:19:10 INFO worker.Genotypes$: no need to perform genotype QC 18/04/03 10:29:02 INFO worker.QualityControl$: 872217 variants before QC 18/04/03 10:29:02 WARN internal.Logging$class: Lost task 3.0 in stage 2.0 (TID 1025, 172.16.22.38, executor 2): java.util.NoSuchElementException: key not found: SS_RawGeno at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.dizhang.seqspark.worker.QualityControl$$anonfun$3.apply(QualityControl.scala:135) at org.dizhang.seqspark.worker.QualityControl$$anonfun$3.apply(QualityControl.scala:135) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185) at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1012) at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1010) at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2125) at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2125) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

18/04/03 10:29:02 ERROR internal.Logging$class: Task 9 in stage 2.0 failed 4 times; aborting job 18/04/03 10:29:02 ERROR seqspark.SingleStudy$: Something went wrong, exit

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgenetics/seqspark/issues/3#issuecomment-378299473, or mute the thread https://github.com/notifications/unsubscribe-auth/AGSK1sPkQyVtZukkBVB5X8uV7p6UFOMoks5tk5qegaJpZM4TA4Nk.

trptyrphe11 commented 6 years ago

The latest version works, thanks. Can you specify how to write the conf parameters using external database, say, maf.source to be set to ExAC, or, for rare variant analysis, annotate use CADD database; Since I saved it in my own file system, shall I just change the path variable in the reference.conf file to something like /home/db/filename? Thank you.