Closed giaosudau closed 4 years ago
is sqlContext
a MemSQLContext
here or a plain SQLContext
? also what is the reported schema
of result
and the schema for the table dev_output.aggregate
(if it already exists)?
`sqlContexthere is a plain
SQLContext``.
I import
import org.apache.spark.sql.memsql.SparkImplicits._
to saveToMemSQL
schema of table aggregate is
CREATE TABLE `aggregate` (
`date` bigint(20) DEFAULT NULL,
`networkId` bigint(20) DEFAULT NULL,
`creativeId` bigint(20) DEFAULT NULL,
`sectionId` bigint(20) DEFAULT NULL,
`zoneId` bigint(20) DEFAULT NULL,
`formatId` int(11) DEFAULT NULL,
`templateId` bigint(20) DEFAULT NULL,
`advertiserId` bigint(20) DEFAULT NULL,
`campaignId` bigint(20) DEFAULT NULL,
`paymentModel` int(11) DEFAULT NULL,
`adDefault` int(11) NOT NULL DEFAULT '0',
`websiteId` bigint(20) NOT NULL DEFAULT '0',
`placementId` int(11) NOT NULL DEFAULT '0',
`topicId` int(11) NOT NULL DEFAULT '0',
`interestId` int(11) NOT NULL DEFAULT '0',
`inMarket` int(11) NOT NULL DEFAULT '0',
`locationId` bigint(20) DEFAULT NULL,
`osId` int(11) DEFAULT NULL,
`browserId` int(11) DEFAULT NULL,
`deviceTypeId` int(11) DEFAULT NULL,
`deviceModelId` int(11) DEFAULT NULL,
`genderId` int(11) DEFAULT NULL,
`ageId` int(11) NOT NULL DEFAULT '0',
`impression` bigint(20) NOT NULL DEFAULT '0',
`trueImpression` bigint(20) NOT NULL DEFAULT '0',
`click` bigint(20) NOT NULL DEFAULT '0',
`revenue` double DEFAULT NULL,
`proceeds` double DEFAULT NULL,
`spent` double DEFAULT NULL,
`memsql_insert_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
KEY `memsql_insert_time` (`memsql_insert_time`)
/*!90618 , SHARD KEY () */
)
It works if I save to parquet file then read it and saveToMemSQL
what happens if you call result.collect()
instead of result.saveToMemSQL
? the stack is coming from here which is using spark's partition iterator
It's ok if I call collect or save as a parquet file. The error I think because the way you alloc the memory for task is not correct. And the error occur when task has not memory to release. You can try by yourself a simple one using group by SQL.
memory allocation is handled by spark, saveToMemSQL doesn't change that.
saveToMemSQL retrieves rows from the dataframe at the partition level - the implementation can be distilled into something like this:
result.foreachPartition(part => {
for (row <- partition) {
println(row) // normally this is inserted into memsql
}
})
i'll try to repro this but in the meantime could you try the above as well?
Any update on this issue? @choochootrain We are hitting into this issue and we are using SQLContext and not MemSQLContext.
@giaosudau How did you overcome this issue?
@shashankgowdal Like I said it's bug. You should prevent use group by clause or you just save to another storage before store to MemSQL.
@giaosudau did you try the above as well? the stack is entirely in Spark land so I'm curious if it manifests without any MemSQL code.
Hi all, I am using it differently but end up with the same issue
grouped_df = df.groupBy(['city', 'month']).agg({'*': 'count', 'cost': 'sum'})
Is there any solution for this yet?
@rendybjunior i'm curious what happens when you do
grouped_df.foreachPartition(part => {
for (row <- partition) {
println(row) // normally this is inserted into memsql
}
})
this is the same dataframe operation that saveToMemSQL
performs, but printing rows instead of inserting them with JDBC.
Please test with the new beta version of our connector. https://github.com/memsql/memsql-spark-connector/tree/3.0.0-beta
It's ok with
But when I add more field to calculate and group by.
It throws error
I am using spark 1.5.2 and memsql 1.3.2