Open sakalouski opened 5 years ago
@zurk Can you please have a look
@sakalouski, just to be sure I tested srcml preprocrepos
and it works with proper spark version which is 2.2.1
(https://github.com/src-d/jgit-spark-connector#pre-requisites) so yes, I think the problem is in old spark. We do not test it for 1.x versions.
Thank You! How about the most recent spark version (2.3.2)?
On Thu, Oct 4, 2018 at 10:37 AM konstantin notifications@github.com wrote:
@sakalouski https://github.com/sakalouski, just to be sure I tested srcml preprocrepos and it works with proper spark version which is 2.2.1 ( https://github.com/src-d/jgit-spark-connector#pre-requisites) so yes, I think the problem is in old spark. We do not test it for 1.x versions.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/src-d/ml/issues/320#issuecomment-426951374, or mute the thread https://github.com/notifications/unsubscribe-auth/AfOlL7YAI2o9eFQ5c4fZlEisUnXwHc9fks5uhdbNgaJpZM4XGNol .
We don't support 2.3.x because they changed the API in a non-backward-compatible way.
I have installed the module as suggested and run the command: srcml preprocrepos -m 50G,50G,50G -r siva --output ./test Where siva is the directory, containing all the siva files. The memory parameters do not change anything. My spark is very old (1.3) - could it be the reason? Is it runnable in pyspark (the latest one)?
_/usr/local/lib64/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from
float
tonp.floating
is deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type
. from ._conv import register_converters as _register_converters INFO:spark:Starting preprocess_repos-424fe007-f0db-48b7-863b-5a5b90ce5f63 on local[*] Ivy Default Cache set to: /home/b7066789/.ivy2/cache The jars for the packages stored in: /home/b7066789/.ivy2/jars :: loading settings :: url = jar:file:/home/b7066789/.local/lib/python3.6/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml tech.sourced#engine added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found tech.sourced#engine;0.6.4 in central found io.netty#netty-all;4.1.17.Final in central found org.eclipse.jgit#org.eclipse.jgit;4.9.0.201710071750-r in central found com.jcraft#jsch;0.1.54 in central found com.googlecode.javaewah#JavaEWAH;1.1.6 in central found org.apache.httpcomponents#httpclient;4.3.6 in central found org.apache.httpcomponents#httpcore;4.3.3 in central found commons-logging#commons-logging;1.1.3 in central found commons-codec#commons-codec;1.6 in central found org.slf4j#slf4j-api;1.7.2 in central found tech.sourced#siva-java;0.1.3 in central found org.bblfsh#bblfsh-client;1.8.2 in central found com.thesamet.scalapb#scalapb-runtime_2.11;0.7.1 in central found com.thesamet.scalapb#lenses_2.11;0.7.0-test2 in central found com.lihaoyi#fastparse_2.11;1.0.0 in central found com.lihaoyi#fastparse-utils_2.11;1.0.0 in central found com.lihaoyi#sourcecode_2.11;0.1.4 in central found com.google.protobuf#protobuf-java;3.5.0 in central found commons-io#commons-io;2.5 in central found io.grpc#grpc-netty;1.10.0 in central found io.grpc#grpc-core;1.10.0 in central found io.grpc#grpc-context;1.10.0 in central found com.google.code.gson#gson;2.7 in central found com.google.guava#guava;19.0 in central found com.google.errorprone#error_prone_annotations;2.1.2 in central found com.google.code.findbugs#jsr305;3.0.0 in central found io.opencensus#opencensus-api;0.11.0 in central found io.opencensus#opencensus-contrib-grpc-metrics;0.11.0 in central found io.netty#netty-codec-http2;4.1.17.Final in central found io.netty#netty-codec-http;4.1.17.Final in central found io.netty#netty-codec;4.1.17.Final in central found io.netty#netty-transport;4.1.17.Final in central found io.netty#netty-buffer;4.1.17.Final in central found io.netty#netty-common;4.1.17.Final in central found io.netty#netty-resolver;4.1.17.Final in central found io.netty#netty-handler;4.1.17.Final in central found io.netty#netty-handler-proxy;4.1.17.Final in central found io.netty#netty-codec-socks;4.1.17.Final in central found com.thesamet.scalapb#scalapb-runtime-grpc_2.11;0.7.1 in central found io.grpc#grpc-stub;1.10.0 in central found io.grpc#grpc-protobuf;1.10.0 in central found com.google.protobuf#protobuf-java;3.5.1 in central found com.google.protobuf#protobuf-java-util;3.5.1 in central found com.google.api.grpc#proto-google-common-protos;1.0.0 in central found io.grpc#grpc-protobuf-lite;1.10.0 in central found org.rogach#scallop_2.11;3.0.3 in central found org.apache.commons#commons-pool2;2.4.3 in central found tech.sourced#enry-java;1.6.3 in central found org.xerial#sqlite-jdbc;3.21.0 in central found com.groupon.dse#spark-metrics;2.0.0 in central found io.dropwizard.metrics#metrics-core;3.1.2 in central :: resolution report :: resolve 1148ms :: artifacts dl 44ms :: modules in use: com.google.api.grpc#proto-google-common-protos;1.0.0 from central in [default] com.google.code.findbugs#jsr305;3.0.0 from central in [default] com.google.code.gson#gson;2.7 from central in [default] com.google.errorprone#error_prone_annotations;2.1.2 from central in [default] com.google.guava#guava;19.0 from central in [default] com.google.protobuf#protobuf-java;3.5.1 from central in [default] com.google.protobuf#protobuf-java-util;3.5.1 from central in [default] com.googlecode.javaewah#JavaEWAH;1.1.6 from central in [default] com.groupon.dse#spark-metrics;2.0.0 from central in [default] com.jcraft#jsch;0.1.54 from central in [default] com.lihaoyi#fastparse-utils_2.11;1.0.0 from central in [default] com.lihaoyi#fastparse_2.11;1.0.0 from central in [default] com.lihaoyi#sourcecode_2.11;0.1.4 from central in [default] com.thesamet.scalapb#lenses_2.11;0.7.0-test2 from central in [default] com.thesamet.scalapb#scalapb-runtime-grpc_2.11;0.7.1 from central in [default] com.thesamet.scalapb#scalapb-runtime_2.11;0.7.1 from central in [default] commons-codec#commons-codec;1.6 from central in [default] commons-io#commons-io;2.5 from central in [default] commons-logging#commons-logging;1.1.3 from central in [default] io.dropwizard.metrics#metrics-core;3.1.2 from central in [default] io.grpc#grpc-context;1.10.0 from central in [default] io.grpc#grpc-core;1.10.0 from central in [default] io.grpc#grpc-netty;1.10.0 from central in [default] io.grpc#grpc-protobuf;1.10.0 from central in [default] io.grpc#grpc-protobuf-lite;1.10.0 from central in [default] io.grpc#grpc-stub;1.10.0 from central in [default] io.netty#netty-all;4.1.17.Final from central in [default] io.netty#netty-buffer;4.1.17.Final from central in [default] io.netty#netty-codec;4.1.17.Final from central in [default] io.netty#netty-codec-http;4.1.17.Final from central in [default] io.netty#netty-codec-http2;4.1.17.Final from central in [default] io.netty#netty-codec-socks;4.1.17.Final from central in [default] io.netty#netty-common;4.1.17.Final from central in [default] io.netty#netty-handler;4.1.17.Final from central in [default] io.netty#netty-handler-proxy;4.1.17.Final from central in [default] io.netty#netty-resolver;4.1.17.Final from central in [default] io.netty#netty-transport;4.1.17.Final from central in [default] io.opencensus#opencensus-api;0.11.0 from central in [default] io.opencensus#opencensus-contrib-grpc-metrics;0.11.0 from central in [default] org.apache.commons#commons-pool2;2.4.3 from central in [default] org.apache.httpcomponents#httpclient;4.3.6 from central in [default] org.apache.httpcomponents#httpcore;4.3.3 from central in [default] org.bblfsh#bblfsh-client;1.8.2 from central in [default] org.eclipse.jgit#org.eclipse.jgit;4.9.0.201710071750-r from central in [default] org.rogach#scallop_2.11;3.0.3 from central in [default] org.slf4j#slf4j-api;1.7.2 from central in [default] org.xerial#sqlite-jdbc;3.21.0 from central in [default] tech.sourced#engine;0.6.4 from central in [default] tech.sourced#enry-java;1.6.3 from central in [default] tech.sourced#siva-java;0.1.3 from central in [default] :: evicted modules: com.google.protobuf#protobuf-java;3.5.0 by [com.google.protobuf#protobuf-java;3.5.1] in [default]:: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 50 already retrieved (0kB/18ms) Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 18/10/03 15:50:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/10/03 15:50:55 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). 18/10/03 15:50:58 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. INFO:engine:Initializing engine on siva INFO:ParquetSaver:Ignition -> DzhigurdaFiles -> UastExtractor -> Moder -> FieldsSelector -> ParquetSaver Traceback (most recent call last): File "/home/b7066789/.local/bin/srcml", line 11, in
sys.exit(main())
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/main.py", line 354, in main
return handler(args)
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/utils/engine.py", line 87, in wrapped_pause
return func(cmdline_args, *args, **kwargs)
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/cmd/preprocess_repos.py", line 24, in preprocess_repos
.link(ParquetSaver(save_loc=args.output)) \
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/transformers/transformer.py", line 114, in execute
head = node(head)
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/transformers/basic.py", line 292, in call
rdd.toDF().write.parquet(self.save_loc)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 58, in toDF
return sparkSession.createDataFrame(self, schema, sampleRatio)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 582, in createDataFrame
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 380, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 351, in inferSchema
first = rdd.first()
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/rdd.py", line 1364, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty