src-d / ml

sourced.ml is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees
Other
141 stars 44 forks source link

ValueError: RDD is empty #320

Open sakalouski opened 5 years ago

sakalouski commented 5 years ago

I have installed the module as suggested and run the command: srcml preprocrepos -m 50G,50G,50G -r siva --output ./test Where siva is the directory, containing all the siva files. The memory parameters do not change anything. My spark is very old (1.3) - could it be the reason? Is it runnable in pyspark (the latest one)?

_/usr/local/lib64/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters INFO:spark:Starting preprocess_repos-424fe007-f0db-48b7-863b-5a5b90ce5f63 on local[*] Ivy Default Cache set to: /home/b7066789/.ivy2/cache The jars for the packages stored in: /home/b7066789/.ivy2/jars :: loading settings :: url = jar:file:/home/b7066789/.local/lib/python3.6/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml tech.sourced#engine added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found tech.sourced#engine;0.6.4 in central found io.netty#netty-all;4.1.17.Final in central found org.eclipse.jgit#org.eclipse.jgit;4.9.0.201710071750-r in central found com.jcraft#jsch;0.1.54 in central found com.googlecode.javaewah#JavaEWAH;1.1.6 in central found org.apache.httpcomponents#httpclient;4.3.6 in central found org.apache.httpcomponents#httpcore;4.3.3 in central found commons-logging#commons-logging;1.1.3 in central found commons-codec#commons-codec;1.6 in central found org.slf4j#slf4j-api;1.7.2 in central found tech.sourced#siva-java;0.1.3 in central found org.bblfsh#bblfsh-client;1.8.2 in central found com.thesamet.scalapb#scalapb-runtime_2.11;0.7.1 in central found com.thesamet.scalapb#lenses_2.11;0.7.0-test2 in central found com.lihaoyi#fastparse_2.11;1.0.0 in central found com.lihaoyi#fastparse-utils_2.11;1.0.0 in central found com.lihaoyi#sourcecode_2.11;0.1.4 in central found com.google.protobuf#protobuf-java;3.5.0 in central found commons-io#commons-io;2.5 in central found io.grpc#grpc-netty;1.10.0 in central found io.grpc#grpc-core;1.10.0 in central found io.grpc#grpc-context;1.10.0 in central found com.google.code.gson#gson;2.7 in central found com.google.guava#guava;19.0 in central found com.google.errorprone#error_prone_annotations;2.1.2 in central found com.google.code.findbugs#jsr305;3.0.0 in central found io.opencensus#opencensus-api;0.11.0 in central found io.opencensus#opencensus-contrib-grpc-metrics;0.11.0 in central found io.netty#netty-codec-http2;4.1.17.Final in central found io.netty#netty-codec-http;4.1.17.Final in central found io.netty#netty-codec;4.1.17.Final in central found io.netty#netty-transport;4.1.17.Final in central found io.netty#netty-buffer;4.1.17.Final in central found io.netty#netty-common;4.1.17.Final in central found io.netty#netty-resolver;4.1.17.Final in central found io.netty#netty-handler;4.1.17.Final in central found io.netty#netty-handler-proxy;4.1.17.Final in central found io.netty#netty-codec-socks;4.1.17.Final in central found com.thesamet.scalapb#scalapb-runtime-grpc_2.11;0.7.1 in central found io.grpc#grpc-stub;1.10.0 in central found io.grpc#grpc-protobuf;1.10.0 in central found com.google.protobuf#protobuf-java;3.5.1 in central found com.google.protobuf#protobuf-java-util;3.5.1 in central found com.google.api.grpc#proto-google-common-protos;1.0.0 in central found io.grpc#grpc-protobuf-lite;1.10.0 in central found org.rogach#scallop_2.11;3.0.3 in central found org.apache.commons#commons-pool2;2.4.3 in central found tech.sourced#enry-java;1.6.3 in central found org.xerial#sqlite-jdbc;3.21.0 in central found com.groupon.dse#spark-metrics;2.0.0 in central found io.dropwizard.metrics#metrics-core;3.1.2 in central :: resolution report :: resolve 1148ms :: artifacts dl 44ms :: modules in use: com.google.api.grpc#proto-google-common-protos;1.0.0 from central in [default] com.google.code.findbugs#jsr305;3.0.0 from central in [default] com.google.code.gson#gson;2.7 from central in [default] com.google.errorprone#error_prone_annotations;2.1.2 from central in [default] com.google.guava#guava;19.0 from central in [default] com.google.protobuf#protobuf-java;3.5.1 from central in [default] com.google.protobuf#protobuf-java-util;3.5.1 from central in [default] com.googlecode.javaewah#JavaEWAH;1.1.6 from central in [default] com.groupon.dse#spark-metrics;2.0.0 from central in [default] com.jcraft#jsch;0.1.54 from central in [default] com.lihaoyi#fastparse-utils_2.11;1.0.0 from central in [default] com.lihaoyi#fastparse_2.11;1.0.0 from central in [default] com.lihaoyi#sourcecode_2.11;0.1.4 from central in [default] com.thesamet.scalapb#lenses_2.11;0.7.0-test2 from central in [default] com.thesamet.scalapb#scalapb-runtime-grpc_2.11;0.7.1 from central in [default] com.thesamet.scalapb#scalapb-runtime_2.11;0.7.1 from central in [default] commons-codec#commons-codec;1.6 from central in [default] commons-io#commons-io;2.5 from central in [default] commons-logging#commons-logging;1.1.3 from central in [default] io.dropwizard.metrics#metrics-core;3.1.2 from central in [default] io.grpc#grpc-context;1.10.0 from central in [default] io.grpc#grpc-core;1.10.0 from central in [default] io.grpc#grpc-netty;1.10.0 from central in [default] io.grpc#grpc-protobuf;1.10.0 from central in [default] io.grpc#grpc-protobuf-lite;1.10.0 from central in [default] io.grpc#grpc-stub;1.10.0 from central in [default] io.netty#netty-all;4.1.17.Final from central in [default] io.netty#netty-buffer;4.1.17.Final from central in [default] io.netty#netty-codec;4.1.17.Final from central in [default] io.netty#netty-codec-http;4.1.17.Final from central in [default] io.netty#netty-codec-http2;4.1.17.Final from central in [default] io.netty#netty-codec-socks;4.1.17.Final from central in [default] io.netty#netty-common;4.1.17.Final from central in [default] io.netty#netty-handler;4.1.17.Final from central in [default] io.netty#netty-handler-proxy;4.1.17.Final from central in [default] io.netty#netty-resolver;4.1.17.Final from central in [default] io.netty#netty-transport;4.1.17.Final from central in [default] io.opencensus#opencensus-api;0.11.0 from central in [default] io.opencensus#opencensus-contrib-grpc-metrics;0.11.0 from central in [default] org.apache.commons#commons-pool2;2.4.3 from central in [default] org.apache.httpcomponents#httpclient;4.3.6 from central in [default] org.apache.httpcomponents#httpcore;4.3.3 from central in [default] org.bblfsh#bblfsh-client;1.8.2 from central in [default] org.eclipse.jgit#org.eclipse.jgit;4.9.0.201710071750-r from central in [default] org.rogach#scallop_2.11;3.0.3 from central in [default] org.slf4j#slf4j-api;1.7.2 from central in [default] org.xerial#sqlite-jdbc;3.21.0 from central in [default] tech.sourced#engine;0.6.4 from central in [default] tech.sourced#enry-java;1.6.3 from central in [default] tech.sourced#siva-java;0.1.3 from central in [default] :: evicted modules: com.google.protobuf#protobuf-java;3.5.0 by [com.google.protobuf#protobuf-java;3.5.1] in [default]

    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   51  |   0   |   0   |   1   ||   50  |   0   |
    ---------------------------------------------------------------------

:: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 50 already retrieved (0kB/18ms) Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 18/10/03 15:50:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/10/03 15:50:55 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). 18/10/03 15:50:58 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. INFO:engine:Initializing engine on siva INFO:ParquetSaver:Ignition -> DzhigurdaFiles -> UastExtractor -> Moder -> FieldsSelector -> ParquetSaver Traceback (most recent call last): File "/home/b7066789/.local/bin/srcml", line 11, in sys.exit(main()) File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/main.py", line 354, in main return handler(args) File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/utils/engine.py", line 87, in wrapped_pause return func(cmdline_args, *args, **kwargs) File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/cmd/preprocess_repos.py", line 24, in preprocess_repos .link(ParquetSaver(save_loc=args.output)) \ File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/transformers/transformer.py", line 114, in execute head = node(head) File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/transformers/basic.py", line 292, in call rdd.toDF().write.parquet(self.save_loc) File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 58, in toDF return sparkSession.createDataFrame(self, schema, sampleRatio) File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 582, in createDataFrame rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 380, in _createFromRDD struct = self._inferSchema(rdd, samplingRatio) File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 351, in inferSchema first = rdd.first() File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/rdd.py", line 1364, in first raise ValueError("RDD is empty") ValueError: RDD is empty

vmarkovtsev commented 5 years ago

@zurk Can you please have a look

zurk commented 5 years ago

@sakalouski, just to be sure I tested srcml preprocrepos and it works with proper spark version which is 2.2.1 (https://github.com/src-d/jgit-spark-connector#pre-requisites) so yes, I think the problem is in old spark. We do not test it for 1.x versions.

sakalouski commented 5 years ago

Thank You! How about the most recent spark version (2.3.2)?

On Thu, Oct 4, 2018 at 10:37 AM konstantin notifications@github.com wrote:

@sakalouski https://github.com/sakalouski, just to be sure I tested srcml preprocrepos and it works with proper spark version which is 2.2.1 ( https://github.com/src-d/jgit-spark-connector#pre-requisites) so yes, I think the problem is in old spark. We do not test it for 1.x versions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/src-d/ml/issues/320#issuecomment-426951374, or mute the thread https://github.com/notifications/unsubscribe-auth/AfOlL7YAI2o9eFQ5c4fZlEisUnXwHc9fks5uhdbNgaJpZM4XGNol .

vmarkovtsev commented 5 years ago

We don't support 2.3.x because they changed the API in a non-backward-compatible way.