REST SparkSQL query: returned variants are duplicated by the number of explodes

The SparkSQL obtained from the REST service returns duplicated variants. This seems to happen due to the explodes in the SQL query.

URL: http://localhost:1042/bigdata/webservices/rest/v1/variant/sql?conservation=phylop%3C0.2;phastCons%3C0.4&table=data&id=rs587604674
SQL: SELECT * FROM data LATERAL VIEW explode(annotation.conservation) acons as cons_phylop LATERAL VIEW explode(annotation.conservation) acons as cons_phastCons LATERAL VIEW explode(annotation.conservation) acons as cons_gerp WHERE id = 'rs587604674' AND ((cons_phylop.source = 'phylop' AND cons_phylop.score<0.2) AND (cons_phastCons.source = 'phastCons' AND cons_phastCons.score<0.4))
Spark result: [Row(id=u'rs587604674', [...], cons_phylop=Row(score=0.10199999809265137, source=u'phylop', description=u''), cons_phastCons=Row(score=0.3869999945163727, source=u'phastCons', description=u''), cons_gerp=Row(score=0.0, source=u'gerp', description=u'')), Row(id=u'rs587604674', [...], cons_phylop=Row(score=0.10199999809265137, source=u'phylop', description=u''), cons_phastCons=Row(score=0.3869999945163727, source=u'phastCons', description=u''), cons_gerp=Row(score=0.3869999945163727, source=u'phastCons', description=u'')), Row(id=u'rs587604674', [...], cons_phylop=Row(score=0.10199999809265137, source=u'phylop', description=u''), cons_phastCons=Row(score=0.3869999945163727, source=u'phastCons', description=u''), cons_gerp=Row(score=0.10199999809265137, source=u'phylop', description=u''))]

In this case there should be just one returned variant with id 'rs587604674'.

Using LATERAL VIEW and explode functions (to query inside array/map structures) has two effects: 1) Additional columns are created in the dataset original (one column per view). E.g. (the column 'cons' in the previous query):

2) The output dataset can contain 'duplicated' rows as mentioned. The content of the 'original' columns is identical in the 'duplicated' rows, but they differ in the 'additional' columns, in our example the 'cons':

+-----------+-----+----------+--------+--------+---------+---------+------+----+------+----+-----+--------------------+--------------------+--------------------+ | id|names|chromosome| start| end|reference|alternate|strand| sv|length|type| hgvs| studies| annotation| cons| +-----------+-----+----------+--------+--------+---------+---------+------+----+------+----+-----+--------------------+--------------------+--------------------+ |rs587604674| []| 22|16064870|16064870| C| T| +|null| 1| SNP|Map()|[[hgva@hsapiens_g...|[22,16064870,C,T,...|[0.38699999451637...| |rs587604674| []| 22|16064870|16064870| C| T| +|null| 1| SNP|Map()|[[hgva@hsapiens_g...|[22,16064870,C,T,...|[0.10199999809265...| +-----------+-----+----------+--------+--------+---------+---------+------+----+------+----+-----+--------------------+--------------------+--------------------+

Content of the additional 'cons' column for each row:

[[0.3869999945163727,phastCons,]] [[0.10199999809265137,phylop,]]

It can be fixed by post-processing the output dataset: 1) Removing additional columns using the function drop: dataset.drop(column_name) 2) Removing 'duplicated' rows suing the function dropDuplicates by a column with unique values (in variant datasets, the column "id"): dataset.dropDuplicates("id")

opencb / hpg-bigdata

REST SparkSQL query: returned variants are duplicated by the number of explodes #60